how to standardise and assemble raw data into sequences: … · 2015-07-15 · how to standardise...

21
How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such Technologies? Dr Joseph Hughes 11th OIE Seminar Saskatoon - 17th June 2015

Upload: others

Post on 26-May-2020

20 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

How to Standardise and Assemble Raw Data into Sequences:What Does it Mean for a Laboratory to Use Such Technologies?"

Dr Joseph Hughes!!!11th OIE Seminar!

Saskatoon - 17th June 2015!

Page 2: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Decreasing sequencing cost!

$0.01

$0.10

$1.00

$10.00

$100.00

$1,000.00

$10,000.00

Jul-98 Apr-01 Jan-04 Oct-06 Jul-09 Apr-12 Dec-14 Sep-17

Cost per raw Megabase of DNA sequence!

http://www.genome.gov/sequencingcosts!

Democratization of sequencing!http://omicsmaps.com!

Page 3: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Applications of High throughput sequencing"

•  Whole genome sequencing!•  Genome variability within a host!•  De-novo assembly of novel viruses!•  Metagenomics of communities!

Page 4: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Considerations for a genome assembly pipeline

•  Flexible pipeline: Handling unknown genotypes or virus samples!

•  Platform independent: work with data from different platforms!

•  Virus independent: work on any virus!•  Scalable to hundreds or thousands of samples!•  Accuracy of SNP calling in the genome (outbreak analysis

where samples are more closely related)!

Page 5: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Known reference" Unknown reference"

Pre-assembly "Processing"

Check format (sff, fastq) !Convert to FASTQ!Remove adaptor contaminants!Remove host genome contamination!Quality & length trimming!

Reference assembly!

De-novo assembly!

Contig merge!

Scaffolding contigs!

Validation!

Consensus!

Variant calling!

Classification!

Assembly"

Post-assembly processing"

Annotation!

Genome comparison!

Page 6: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Examples

1.  1999-2001 in Northern Italy: emergence of highly pathogenic avian influenza H7N1!

•  Identify known molecular markers for viral pathogenicity in intra-host viral populations!

•  OIE & FAO reference lab for Influenza!

2.  2010 in the Netherlands: die-off of >1000 wild water frogs and newts!

•  Isolation, characterisation and relationship to known viruses of the Dutch frog killer!

•  Van Beurden et al. (2014). Genome Announc.!

hybrid Edible frog !(Pelophylax kl. esculentus)!

Page 7: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Example 1:Characterization of HPAI signature mutations"

Monne et al. (2014). Journal of Virology!!

Page 8: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Pre-assembly processing"

trim_galore and FastQC for quality control!

Page 9: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Reference assemblers?"

•  Hash based tools: Mosaik, Novoalign, Stampy, Tanoti!•  Borrrows-Wheeler Transform-based tools: BWA, Bowtie2,

NextGenMap!

Too many to choose from!

http://www.bioinformatics.cvr.ac.uk/Tanoti!

HA

position

log10(DOC)

0.0

0.5

1.0

1.5

2.0

2.5

500 1000 1500

M

position

log10(DOC)

0

1

2

3

4

200 400 600 800 1000

NA

position

log10(DOC)

0.00.51.01.52.02.53.03.5

200 400 600 800 1000 1200

NP

position

log10(DOC)

0

1

2

3

500 1000 1500

NS

position

log10(DOC)

0

1

2

3

4

200 400 600 800

PA

position

log10(DOC)

0.0

0.5

1.0

1.5

2.0

2.5

500 1000 1500 2000

PB1

position

log10(DOC)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

500 1000 1500 2000

PB2

position

log10(DOC)

0

1

2

3

500 1000 1500 2000

Bowtie2 and Stampy !Tanoti!!

Page 10: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Tablet - assembly

Page 11: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Variant calling – detecting true mutations"•  Many tools LoFreq, Vphaser, DiversiTools!•  Using replicates to validate mutations (e.g. FMDV

experiments)!!

One LPAI sample collected after the identification of HPAI with an HA cleavage site and multiple HPAI associated mutations at extremely low frequency!

PB2_I398T

PB1_D154G

PB1_G216S

PB1_E745K

PA_T61I

PA_K115N

PA_K252E

HA_A130T

HA_T146A

HA_E228A

HA_T454A

HA_R554K

NP_A349T

NP_N376S

NA_K173R

M1_A166V

NS1_I136V

NS1_N139D

NS1_-225R

X4756.99

X4827.99

X4828.99

X4911.99

X4708.99

X4618.99

X4618.99.1

X4749.99

PB2_I398T

PB1_D154G

PB1_G216S

PB1_E745K

PA_T61I

PA_K115N

PA_K252E

HA_A130T

HA_T146A

HA_E228A

HA_T454A

HA_R554K

NP_A349T

NP_N376S

NA_K173R

M1_A166V

NS1_I136V

NS1_N139D

NS1_-225R

X4295.99

X3675.99

X4829.99

X1744.99

X2732.99

X3283.99

Frequency of LPAI in HPAI samples ! Frequency of HPAI in LPAI samples !

Amino acid changes!

Samples!

Amino acid changes!

Page 12: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Example 2:Isolation and Sequencing"

•  From dead wild water frog in September 2013!•  Suspension from pooled internal organs!•  Inoculated on BF-2 cells (Bluegill Fry cells fibroblast)!•  DNA extracted using Dneasy kit (viral purity of 67%!•  DNA sheared by sonication!•  KAPA library preparation!•  MiSeq (Illumina) Machine #2 test run: total run 26,700,000

reads including 50% PhiX (16Gb)!•  13,127,123 paired-end 300 bp reads from the sample (7.9

Gb)!

Page 13: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Assembly"

•  Abyss-pe de-novo assembler reconstructed the full-genome in a single contig of 107,260!

•  5 different regions had ambiguous/repetitive sequences !

•  Re-sequencing ambiguous regions with Sanger!

1!

1692!

1693!

21168!

21359!

38364!

38387!

66887!

67100!

73322!

73434!

107260!

?! ?! ?! ?! ?!

Page 14: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Finishing assembly"

•  CodonCode Aligner for assembling and checking the Sanger sequences!

•  SequencePatcher.pl to stitch the Sanger sequences into the de-novo contig!

•  iCORN2!

•  Final genome of 107,260 => 107,772bp!

Page 15: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Annotating

•  BLAST to find the most similar annotated genome!•  Common Midwife Toad Virus (CMTV) from Spain!

•  Transfer of annotations from CMTV to the full genome (RATT)!

•  Identifies inappropriate start codons, frame-shifts!

•  Correcting of transferred models using Artemis!

Page 16: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

20 kb

RGV JQ654586

STIV EU627010

FV3 KJ175144

FV3 AY548484

TFV AF389451

CGSIV KF512820

ADRV KF033124

ADRV KC865735

CMTV NL

CMTV JQ231222

ATV AY150217

EHNV FJ433873

ESV JQ724856

84!

95!

100!

100!

76!

100!

100!

100!

Page 17: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Standard formats"•  FASTQ – quality score depends on the technology and

base caller!!

•  SAM – soon v1.5 extensions!

Page 18: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Genome standards – 5 categories!

Ladner et al.(2014) mBio !

% genome! covered!!>50%!!!~80-90%!!!~90-99%!!!100%!!!100%!!

HTS! coverage!!!!!~15-30 x!!!>100 x!!!RACE!!~ 400 !– 1000 x!!

Page 19: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

1990 1992 1994 1996 1998 2000 2003 2004 2006 2008 2010 20120

1

10

100

1,000

10,000

100,000

1,000,000

0.1

1

10

100

1000

10,000

100,000

1,000,000

10,000,000

100,000,000

Year

Dis

k st

orag

e (M

byte

s/$)

DN

A sequencing (bp/$)

Hard disk storage (MB/$)Doubling time 14 months

Pre-NGS (bp/$)Doubling time 19 months

-

NGS (bp/$) Doubling time 5 months

http://genomebiology.com/2010/11/5/207!

Challenges: Rates of increase in data"

Page 20: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

Challenges: resources and technologies"

•  Shift towards more data, labs need to have dedicated bioinformaticians!

•  Rule of thumb: invest as much in computers and data scientists as in sequencing equipment and lab technicians!

•  Non-uniform coverage, repeat regions, systematic biases, PCR errors, sequencing errors, sequence length!

Page 21: How to Standardise and Assemble Raw Data into Sequences: … · 2015-07-15 · How to Standardise and Assemble Raw Data into Sequences: What Does it Mean for a Laboratory to Use Such

CVR bioinformatics team!

Director of OIE Collaborating Centre for Viral Genomics and Bioinformatics!

Director of Centre for Virus Research!