hybrid and non hybrid error correction for long reads ... · erroneous k-mers have low expected...

69
HAL Id: lirmm-01446434 https://hal-lirmm.ccsd.cnrs.fr/lirmm-01446434 Submitted on 25 Jan 2017 HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci- entific research documents, whether they are pub- lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Hybrid and non hybrid error correction for long reads: LoRDEC and LoRMA Eric Rivals To cite this version: Eric Rivals. Hybrid and non hybrid error correction for long reads: LoRDEC and LoRMA. Colib’read workshop, ANR Colib’read, Nov 2016, Paris, France. lirmm-01446434

Upload: others

Post on 12-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

HAL Id: lirmm-01446434https://hal-lirmm.ccsd.cnrs.fr/lirmm-01446434

Submitted on 25 Jan 2017

HAL is a multi-disciplinary open accessarchive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come fromteaching and research institutions in France orabroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, estdestinée au dépôt et à la diffusion de documentsscientifiques de niveau recherche, publiés ou non,émanant des établissements d’enseignement et derecherche français ou étrangers, des laboratoirespublics ou privés.

Hybrid and non hybrid error correction for long reads:LoRDEC and LoRMA

Eric Rivals

To cite this version:Eric Rivals. Hybrid and non hybrid error correction for long reads: LoRDEC and LoRMA. Colib’readworkshop, ANR Colib’read, Nov 2016, Paris, France. �lirmm-01446434�

Page 2: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Hybrid and non hybrid error correction for long reads:LoRDEC and LoRMA

Eric Rivals

Computer Science Lab & Institute Computational Biology, CNRS & Univ. Montpellier

7th Nov. 2016

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 1 / 59

Page 3: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Outline

1 Introduction

2 LoRDEC algorithm

3 LoRDEC experimental resultsImpact of parametersScalabilityCorrection of transcriptomic reads (RNA-seq)Correction of Oxford Nanopore MINIon reads

4 LoRDEC∗+LoRMA

5 LoRMA experimental results

6 Conclusion and future works

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 2 / 59

Page 4: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Outline

1 Introduction

2 LoRDEC algorithm

3 LoRDEC experimental resultsImpact of parametersScalabilityCorrection of transcriptomic reads (RNA-seq)Correction of Oxford Nanopore MINIon reads

4 LoRDEC∗+LoRMA

5 LoRMA experimental results

6 Conclusion and future works

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 3 / 59

Page 5: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Revolution in DNA sequencing

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 4 / 59

Page 6: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Third generation technologies

©Oxford Nanopore

PacBio: Pacific Biosciencesup to 25 Kbp

Oxford Nanopore MINionup to 50 Kbp

Moleculo synthetic readsup to 10 Kbp

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 5 / 59

Page 7: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Overview of sequencing techniques

Name Read Lg Time Gb/run pros / cons

454 GS Flex 700 1 d 0.7 long / indels

Illumina HiSeq X 2*300 3 d 200 short/cost

Illumina NextSeq 500 2*300 3 d 150 PE, single/idem

SOLID (LifeSc) 85 8 d 150 long time

Ion Proton 200 2 h 100 new

Illumina TrueSeq 10-8500 − 4 synthetic reads

PacBio Sciences 10-40000 0.3 d 3 high error rate

Oxford MINion 10-50000 1 d 0.8 high error rate

The vast majority of errors for PacBio and Oxford are insertions & deletions.

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 6 / 59

Page 8: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Context

3rd generation sequencing technologies yield longer reads

PacBio Single Molecule Real Time sequencing:much longer reads (up to 25 Kb) but much higher error rates

Error correction is required1 self correction: using long reads only2 hybrid correction: using short reads to correct long reads

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 7 / 59

Page 9: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Context

3rd generation sequencing technologies yield longer reads

PacBio Single Molecule Real Time sequencing:much longer reads (up to 25 Kb) but much higher error rates

Error correction is required1 self correction: using long reads only2 hybrid correction: using short reads to correct long reads

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 7 / 59

Page 10: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Context

3rd generation sequencing technologies yield longer reads

PacBio Single Molecule Real Time sequencing:much longer reads (up to 25 Kb) but much higher error rates

Error correction is required1 self correction: using long reads only2 hybrid correction: using short reads to correct long reads

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 7 / 59

Page 11: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Hybrid correction methods

[Koren et al, Nat. Bio. 2012]

Short reads are aligned to long reads

a consensus is applied to correct part of the long read

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 8 / 59

Page 12: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Self correction methods

[Chin et al, Nat. Met. 2013]

Long reads are corrected with shorter reads from same technology

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 9 / 59

Page 13: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Other hybrid PacBio error correction programs

PacBioToCA [Koren et al. 2012]

AHA [Bashir et al. 2012]inside the assembler

LSC [Au et al. 2012]compress homopolymers before alignment

All follow an alignment based strategy (e.g. BLAST like)

proovread [Hackl et al. 2014]: alignment & chimera detection

Jabba [Miclotte et al. 2015]: LoRDEC’s approach + MEM based alignmentvariable length seeds for anchoring the LR on graph

CoLoRMap [Haghshenas et al. 2016]: alignment & local assembly

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 10 / 59

Page 14: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Other hybrid PacBio error correction programs

PacBioToCA [Koren et al. 2012]

AHA [Bashir et al. 2012]inside the assembler

LSC [Au et al. 2012]compress homopolymers before alignment

All follow an alignment based strategy (e.g. BLAST like)

proovread [Hackl et al. 2014]: alignment & chimera detection

Jabba [Miclotte et al. 2015]: LoRDEC’s approach + MEM based alignmentvariable length seeds for anchoring the LR on graph

CoLoRMap [Haghshenas et al. 2016]: alignment & local assembly

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 10 / 59

Page 15: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Hybrid correction and assembly

ECtools [Lee et al. bioRxiv 2014]assemble SR into unitigs, assemble unitigs and LR with Celera

Nanocorr [Goodwin et al. bioRxiv 2014]recruit SR for a LR using BLAST,select SR with Longest Increasing Subsequence (LIS)compute consensusassembly with Celera

NaS (Nanopore) [Madoui et al BMC Genomics 2015]recruit SR for each LR and reassemble the LR sequencecomplex pipeline

All need to assemble SR

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 11 / 59

Page 16: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Hybrid correction and assembly

ECtools [Lee et al. bioRxiv 2014]assemble SR into unitigs, assemble unitigs and LR with Celera

Nanocorr [Goodwin et al. bioRxiv 2014]recruit SR for a LR using BLAST,select SR with Longest Increasing Subsequence (LIS)compute consensusassembly with Celera

NaS (Nanopore) [Madoui et al BMC Genomics 2015]recruit SR for each LR and reassemble the LR sequencecomplex pipeline

All need to assemble SR

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 11 / 59

Page 17: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Motivation

LR correction programs ”require high computational resourcesand long running times on a supercomputer even for bacterialgenome datasets”.

[Deshpande et al. 2013]

For a 1 Gb plant genome, correction of 18x PacBio with 160xIllumina required 600000 CPU hours with EC-tools !

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 12 / 59

Page 18: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Motivation

LR correction programs ”require high computational resourcesand long running times on a supercomputer even for bacterialgenome datasets”.

[Deshpande et al. 2013]

For a 1 Gb plant genome, correction of 18x PacBio with 160xIllumina required 600000 CPU hours with EC-tools !

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 12 / 59

Page 19: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Contributions

LoRDEC

a new and efficient hybrid correction algorithm

based on De Bruijn Graphs (DBG) of short reads

avoids the time consuming alignments (of SR on LR)

LoRMA

a complementary tool to LoRDEC for self correction of long reads

a pipeline that iterates LoRDEC and apply LoRMA

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 13 / 59

Page 20: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Apercu of raw and corrected PacBio reads

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 14 / 59

Page 21: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Introduction

Apercu of raw and corrected PacBio reads

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 14 / 59

Page 22: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Outline

1 Introduction

2 LoRDEC algorithm

3 LoRDEC experimental resultsImpact of parametersScalabilityCorrection of transcriptomic reads (RNA-seq)Correction of Oxford Nanopore MINIon reads

4 LoRDEC∗+LoRMA

5 LoRMA experimental results

6 Conclusion and future works

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 15 / 59

Page 23: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Algorithm overview

1 build a de Bruijn graph of the short reads

the graph represents the short reads in compact form

2 take each long read in turn and attempt to correct it

1 correct internal regions,

2 correct end regions of the long read

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 16 / 59

Page 24: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Algorithm overview

1 build a de Bruijn graph of the short reads

the graph represents the short reads in compact form

2 take each long read in turn and attempt to correct it

1 correct internal regions,

2 correct end regions of the long read

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 16 / 59

Page 25: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Example of short read DBG of order 3

gga

gac acg cga gag

aac gaa agc

caa gca

S = {ggacgaa, cgaac , gacgag , cgagcaa, gcaacg}The DBG is built from the set of short reads (Illumina)

using the GATB library.

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 17 / 59

Page 26: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Filtering k-mers of short reads

Filtering k-mer rationale

Because errors are randomly positioned

Erroneous k-mers have low expected occurrence numbers

Threshold based filter s: minimum number of occurrences in short reads

All k-mers present more than s times are called solid k-mers

and kept in the de Bruijn Graph

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 18 / 59

Page 27: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Example of filtered short read DBG of order 3

raw graph filtered with s = 1

gga

gac acg cga gag

aac gaa agc

caa gca

gac acg cga gag

aac gaa

caa gca

S = {ggacgaa, cgaac , gacgag , cgagcaa, gcaacg}

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 19 / 59

Page 28: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Long read sequence is partitioned

innerregion

head tail

sources targets

◦: solid k-mers of the long read

Solid k-mers are a priori correct piece of the sequences

we correct the region between two solid k-mers

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 20 / 59

Page 29: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Long read is corrected with DBG

bridge path

s1 t1

path not found

s2 t2

extension path

s3

For each putative region of a long read:

align the region to paths of the de Bruijn graph

find best path according to edit distance

limited path search

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 21 / 59

Page 30: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

LoRDEC: Correcting read ends

Find a path in DBG starting from theextreme solid k-mer

Maximize length of the prefix of the end tocorrect

Minimize edit distance between the pathand the prefix of the end

Find best extension maximizing analignment score

bridge path

s1 t1

path not found

s2 t2

extension path

s3

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 22 / 59

Page 31: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Correction algorithm

1 Correct inner region:

1 depth first search traversal of paths between source and target k-mers

2 node wise: minimal edit distance computation with seq region

2 Correct end region:

3 Paths optimisation:

1 build a graph of all correction paths for current read

2 finding a shortest path between the first and last solid k-mersDijkstra algorithm

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 23 / 59

Page 32: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

Trimming and splitting (optional)

Classify each base as solid if itbelongs to at least one solidk-mer and weak otherwise

LoRDEC outputs solid bases inupper case characters and weakones in lower case characters

Corrected reads can be trimmedand/or split:

1 Trim weak bases from bothends of the read

2 Extract all runs of solid basesfrom the corrected reads

Output of LoRDEC:>read1acgtgaGTAGTCGAGTagcgtagGTGGATCGAGCTAGggggt

Trimmed read:>read1GTAGTCGAGTagcgtagGTGGATCGAGCTAG

Trimmed and split reads:>read1 1GTAGTCGAGT>read1 2GTGGATCGAGCTAG

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 24 / 59

Page 33: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC algorithm

LoRDEC correction pipline

Filtering short-reads data for quality value and adapter presencecutadapt [Martin, 2012]

Long reads correction with LoRDEC.Two parameters must be set :

I k-mer length – default k = 19I threshold : minimum abundance for a k-mer to be solid

that is, to be included in the de Bruijn graph

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 25 / 59

Page 34: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Outline

1 Introduction

2 LoRDEC algorithm

3 LoRDEC experimental resultsImpact of parametersScalabilityCorrection of transcriptomic reads (RNA-seq)Correction of Oxford Nanopore MINIon reads

4 LoRDEC∗+LoRMA

5 LoRMA experimental results

6 Conclusion and future works

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 26 / 59

Page 35: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Data sets

E. coli Yeast Parrot

Genome size 4.6 Mbp 12 Mbp 1.23 Gbp

PacBio coverage 21x 129x 5.5x

Illumina coverage 50x 38x 28x

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 27 / 59

Page 36: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Results: time and memory

Data Method CPU time Elapsed time Memory Disk

PacBioToCA 45 h 18 min 3 h 12 min 9.91 13.59

E. coli LSC 39 h 48 min 2h 56 min 8.21 8.51

LoRDEC 2 h 16 min 10 min 0.96 0.41

PacBioToCA 792 h 41 min 21 h 57 min 13.88 214

Yeast LSC 1200 h 46 min 130 h 16 min 24.04 517

LoRDEC 56 h 08 min 3 h 37 min 0.97 1.63

Parrot LoRDEC 568 h 48 min 29 h 7 min 4.61 74.85

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 28 / 59

Page 37: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Runtime, memory and disk usage

CPU time (h) Memory (GB) Disk (GB)0

200

400

600

800

1000

1200

Yeast

PacBioToCALSCLoRDEC

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 29 / 59

Page 38: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Evaluation methods

Two ways:

1 how do the reads align to the genome?

2 how do raw and corrected reads differ in their alignments?

Using the Error Correction Toolkit [Yang et al. 2013] we compute

Sensitivity = TP/(TP+FN)how well does the tool recognise erroneous positions?

Gain = (TP-FP)/(TP+FN)how well does the tool remove errors without introducing new ones?

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 30 / 59

Page 39: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Error correction performance: E. coli

Data Size Aligned Identity Genome coverage0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

UncorrectedPacBioToCALSCLoRDEC

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 31 / 59

Page 40: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Error correction performance: Parrot

Data Size Aligned Identity Genome coverage0.4

0.5

0.6

0.7

0.8

0.9

1

Uncorrected

LoRDEC

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 32 / 59

Page 41: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results

Sensitivity and gain results

Data Method Sensitivity Gain

PacBioToCA NA NA

E. coli LSC 0.2865 0.2232

LoRDEC 0.9090 0.8997

PacBioToCA1 NA NA

Yeast LSC 0.3246 0.2596

LoRDEC 0.8427 0.8194

Parrot LoRDEC 0.8962 0.8544

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 33 / 59

Page 42: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Impact of parameters

Parameters: E. coli

0.7

0.75

0.8

0.85

0.9

14 16 18 20 22 24 26 28 30 0

50000

100000

150000

200000

Gai

n

Run

time

(s)

k

GainRuntime

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 34 / 59

Page 43: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Scalability

Scalability of LoRDEC

CPU time (h) Memory (GB) Disk (GB)0.1

1

10

100

1000

E. coliYeastParrot

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 35 / 59

Page 44: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Scalability

Scalability of LoRDEC

Mais transcriptome data

Illumina HiSeq : 194 million of reads, 29 Tbp

PacBio : 276000 reads, 168 Gbp

LoRDEC time: 12 hours

LoRDEC memory: 5 Gbytes

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 36 / 59

Page 45: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Correction of transcriptomic reads (RNA-seq)

Chicken transcriptome with PacBio

PacBio data Raws Corrected and trimmed

# reads (x1000) 1 849 1 848

# reads > 1Kbp (x1000) 687 569

Max length of reads (kbp) 12.2 11.9

Total length (Gbp) 1.98 1.77

%GC 48.08 47.28

Avg length (bp) 1 075 960

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 37 / 59

Page 46: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Correction of transcriptomic reads (RNA-seq)

Chicken transcriptome with PacBio

After correction and mapping with BWA-MEM [Li H., 2013]on ref. transcriptome (1 RNA per gene)

5% more transcripts covered with uniquely mapping reads

80% id in alignments vs 66% before correction

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 38 / 59

Page 47: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Correction of transcriptomic reads (RNA-seq)

Apercu of raw and corrected PacBio RNA reads

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 39 / 59

Page 48: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Correction of transcriptomic reads (RNA-seq)

Apercu of raw and corrected PacBio RNA reads

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 39 / 59

Page 49: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Correction of Oxford Nanopore MINIon reads

Correcting E. coli Nanopore MINIon data

Raw reads + quast

Corrected reads + quast

Nanopore data Raw Corrected

Nb reads 3463 2749

Nb reads ≥ 1kbp 3420 2685

Total length (Mbp) 22 17

Unaligned bases (%) 99.99 7.60

Genome fraction (%) 0.02 96.59

Quast [Gurevich et al. 2013]

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 40 / 59

Page 50: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC experimental results Correction of Oxford Nanopore MINIon reads

MINion S. aureus data

Mapping of reads with BWA-MEM onto the reference genomewith appropriate options

ref genome: 2.8 Mbp

MINIon sequencing coverage 14x

gain for k = 17 and s = 2 reaches 69%

99, 9 % genome covered by corrected reads

65 % genome at median coverage 8x

79% identity instead of 66 % without correction

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 41 / 59

Page 51: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC∗+LoRMA

Outline

1 Introduction

2 LoRDEC algorithm

3 LoRDEC experimental resultsImpact of parametersScalabilityCorrection of transcriptomic reads (RNA-seq)Correction of Oxford Nanopore MINIon reads

4 LoRDEC∗+LoRMA

5 LoRMA experimental results

6 Conclusion and future works

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 42 / 59

Page 52: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC∗+LoRMA

Overview of LoRDEC∗+LoRMA

Modify LoRDEC to run on long reads only =⇒ LoRDEC∗

Run LoRDEC∗ iteratively with increasing k

Polish the result with multiple alignments =⇒ LoRMA

PacBio reads LoRDEC∗ LoRMA Corrected reads

increase k

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 43 / 59

Page 53: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC∗+LoRMA

LoRDEC

Build a de Bruijn graph of the short reads

I Use a small k such thatthe genomic k-mers are expected to be found in the reads

I Use an abundancy threshold to differentiatebetween correct and erroneous k-mers

For each long read:I Classify k-mers: solid (= in the DBG) and weakI Find paths in the DBG between the solid k-mersI Minimize edit distance between the long read and the path’s string

I Select a correcting path only if all possibilities have been explored.

ACGT

CGTT GTTC TTCA TCAA CAAC

AACC ACCC CCCT

CGTA GTAA TAAC

AGTT TTCC

TAAG

ACGT

C CAACT CCCT

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 44 / 59

Page 54: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC∗+LoRMA

LoRDEC∗

Build a de Bruijn graph of the LONG readsI Use a small k such that

the genomic k-mers are expected to be found in the readsI Use an abundancy threshold to differentiate

between correct and erroneous k-mers

For each long read:I Classify k-mers: solid (= in the DBG) and weakI Find paths in the DBG between the solid k-mersI Minimize edit distance between the long read and the path’s stringI Select a correcting path only if all possibilities have been explored.

ACGT

CGTT GTTC TTCA TCAA CAAC

AACC ACCC CCCT

CGTA GTAA TAAC

AGTT TTCC

TAAG

ACGT

C CAACT CCCT

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 44 / 59

Page 55: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRDEC∗+LoRMA

LoRMA

Build a de Bruijn graph of the reads

Annotate the graph by threading each read through the graph

For each read find its friends, i.e. the most similar reads

Use a multiple alignment of a read and its friends to correct the read

ACGT

CGTT GTTC TTCA TCAA CAAC

AACC ACCC CCCT

CGTA GTAA TAAC

AGTT TTCC

TAAG

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 45 / 59

Page 56: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Outline

1 Introduction

2 LoRDEC algorithm

3 LoRDEC experimental resultsImpact of parametersScalabilityCorrection of transcriptomic reads (RNA-seq)Correction of Oxford Nanopore MINIon reads

4 LoRDEC∗+LoRMA

5 LoRMA experimental results

6 Conclusion and future works

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 46 / 59

Page 57: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Evaluation method

Process

1 Align the raw and corrected reads to the genomewith BLASR [Chaisson et Tesler, 2012]

2 Consider a single best alignment.

Compute following metrics

total size of corrected reads

total aligned size of corrected

error rate of aligned regions (nb erroneous positions / aligned length)

genome coverage

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 47 / 59

Page 58: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Selfcorrection: E. coli with k = 19, 40, 61

Size AlignedGenomeCoverage

0

20

40

60

80

100

(%)

ErrorRate

0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

17

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 48 / 59

Page 59: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Selfcorrection and hybrid correction: E. coli

Size AlignedGenomeCoverage

0

20

40

60

80

100

(%)

ErrorRate

0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

17

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 49 / 59

Page 60: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Selfcorrection: Yeast

Size AlignedGenomeCoverage

0

20

40

60

80

100

(%)

ErrorRate

0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

17

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 50 / 59

Page 61: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Selfcorrection and hybrid correction: Yeast

Size AlignedGenomeCoverage

0

20

40

60

80

100

(%)

ErrorRate

0

0.5

1

1.5

2

2.5

3

Original

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

17

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 51 / 59

Page 62: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Selfcorrection: Resources

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

E. coli

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

Yeast

PBcR (self)

LoRDEC*+LoRMA

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 52 / 59

Page 63: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

LoRMA experimental results

Selfcorrection and hybrid correction: Resources

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

E. coli

160

Runtime(h) Memory(GB) Disk(GB)

0

10

20

30

40

Yeast

PBcR (self)

LoRDEC*+LoRMA

LoRDEC

proovread

PBcR (hybrid)

Jabba

158

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 53 / 59

Page 64: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Conclusion and future works

Outline

1 Introduction

2 LoRDEC algorithm

3 LoRDEC experimental resultsImpact of parametersScalabilityCorrection of transcriptomic reads (RNA-seq)Correction of Oxford Nanopore MINIon reads

4 LoRDEC∗+LoRMA

5 LoRMA experimental results

6 Conclusion and future works

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 54 / 59

Page 65: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Conclusion and future works

Take home message

LoRDEC is

at least 6 times faster than previous methods

uses at least 93% less memory than previous methods

corrects both PacBio & Nanopore reads

scales up to vertebrate cases

achieves similar accuracy as state-of-the-art methods.

LoRDEC is freely available at http://atgc.lirmm.fr/lordec/

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 55 / 59

Page 66: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Conclusion and future works

LoRDEC and LoRMA use GATB

http://gatb.inria.fr

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 56 / 59

Page 67: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Conclusion and future works

Conclusions

LoRDEC∗+LoRMA [Bioinformatics 2016]:

DBG based initial correction of sequencing errors in long read data

Further polishing with multiple alignments

Accurate selfcorrection method, needs high coverage (75×)

Future: improve memory footprint and running time

Freely available at http://www.cs.helsinki.fi/u/lmsalmel/LoRMA/

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 57 / 59

Page 68: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Conclusion and future works

LoRDEC and LoRMA publications

LoRDEC: accurate and efficient long read error correction

L. Salmela, E. Rivals

Bioinformatics, doi:10.1093/bioinformatics/btu538, 30 (24):3506-3514, 2014.

Accurate selfcorrection of errors in long reads using de Bruijngraphs

L. Salmela, R. Walve, E. Rivals, E. Ukkonen

Bioinformatics, doi: 10.1093/bioinformatics/btw321, 2016.

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 58 / 59

Page 69: Hybrid and non hybrid error correction for long reads ... · Erroneous k-mers have low expected occurrence numbers Threshold based lter s: minimum number of occurrences in short reads

Conclusion and future works

Funding and acknowledgements

Thank you for your attention!

Questions?

Thanks to L. Salmela, R. Wake, E. Ukkonen, A. Makrini

Rivals (CNRS Univ. Montpellier) Long read correction 7th Nov. 2016 59 / 59