scaling metagenome assembly

40
C. TITUS BROWN ET AL. COMPUTER SCIENCE / MICROBIOLOGY DEPTS MICHIGAN STATE UNIVERSITY IN COLLABORATION WITH GREAT PRAIRIE GRAND CHALLENGE (TIEDJE, JANSSON, TRINGE) Scaling metagenome assembly – to infinity and beeeeeeeeeeyond!

Upload: ctitusbrown

Post on 10-May-2015

1.521 views

Category:

Technology


2 download

DESCRIPTION

Talk given at JGI metagenome assembly workshop, oct 12, 2011.

TRANSCRIPT

Page 1: Scaling metagenome assembly

C. T I TUS BROWN ET A L .COM PUTER SCI ENCE / M I CROBI OLOGY DEPTS

M I CH I GA N STATE UNI V ERSI TY

I N COLLA B ORATI ON WI TH GREAT PRA I RI E GRA ND CH A LLENGE

(TI EDJE , JA NSSON, TRI NGE)

Scaling metagenome assembly –to infinity and beeeeeeeeeeyond!

Page 2: Scaling metagenome assembly

SAMPLING LOCATIONS

Page 3: Scaling metagenome assembly

Sampling strategy per site

Reference soil

Soil cores: 1 inch diameter, 4 inches deep

Total:

8 Reference metagenomes +

64 spatially separated cores (pyrotag sequencing)

10 M

10 M

1 M

1 M

1 cM

1 cM

Page 4: Scaling metagenome assembly

0

50

100

150

200

250

300

350

GAII HiSeq

Basepair

s o

f S

equ

en

cin

g (

Gbp)

200x human genome…!> 10x more challenging (total diversity)

Great Prairie sequencing summary

Page 5: Scaling metagenome assembly

Our perspective

Great Prairie project: there is no end to the data! Immense biological depth: estimate ~1-2 TB (10**12) of raw

sequence needed to assemble top ~20-40% of microbes. Improvements in sequencing tech

Existing methods for scaling assembly simply will not suffice: this is a losing battle. Abundance filtering XXX Better data structures XXX

Parallelization is not going to be sufficient; neither are advances in data structures.

I think: bad scaling is holding back assembly progress.

Page 6: Scaling metagenome assembly

Our perspective, #2

Deep sampling is needed for these samples Illumina is it, for now.

The last thing in the world we want to do is write yet another assembler… pre-assembly filtering, instead. All of our techniques can be used together with any

assembler. We’ve mostly stuck with Velvet, for reasons of

historical contingency.

Page 7: Scaling metagenome assembly

Two enabling technologies

Very efficient k-mer counting Bloom counting hash/MinCount Sketch data structure;

constant memory Scales ~10x over traditional data structures k-independent. Probabilistic properties well suited to next-gen data sets.

Very efficient de Bruijn graph representation We traverse k-mers stored in constant-memory Bloom filters. Compressible probabilistic data structure; very accurate. Scales ~20x over traditional data structures. K-independent. …cannot directly be used for assembly because of FP.

Page 8: Scaling metagenome assembly

Approach 1: Partitioning

Use compressible graph representation to explore natural structure of data: many disconnected

components.

Page 9: Scaling metagenome assembly

Partitioning for scaling

Can be done in ~10x less memory than assembly.

Partition at low k and assemble exactly at any higher k (DBG).

Partitions can then be assembled independently Multiple processors -> scaling Multiple k, coverage -> improved assembly Multiple assembly packages (tailored to high variation, etc.)

Can eliminate small partitions/contigs in the partitioning phase.

In theory, an exact approach to divide and conquer/data reduction.

Page 10: Scaling metagenome assembly

0 100 200 300 400 500 600 700 800 900 10000

1

2

3

4

5

6

7

8

9

10

Average Coverage of K-mers in Partitions

Rank Abundance Partition Number

Covera

ge (

K-m

e

Adina Howe

Page 11: Scaling metagenome assembly

Partitioning challenges

Technical challenge: existence of “knots” in the graph that artificially connect everything.

Unfortunately, partitioning is not the solution.

Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have…

20x scaling isn’t nearly enough, anyway

Page 12: Scaling metagenome assembly

Digression: sequencing artifacts

Adina Howe

Page 13: Scaling metagenome assembly

Partitioning challenges

Unfortunately, partitioning is not the solution.

Runs afoul of same k-mer/error scaling problem that all k-mer assemblers have…

20x scaling isn’t nearly enough, anyway

Page 14: Scaling metagenome assembly

Approach 2: Digital normalization

“Squash” high coverage reads Eliminate reads we’ve seen before (e.g. “> 5 times”) Digital version of experimental “mRNA normalization”.

Nice algorithm! Single-pass Constant memory Trivial to implement Easy to parallelize / scale (memory AND throughput)

“Perfect” solution?

(Works fine for MDA, mRNAseq…)

Page 15: Scaling metagenome assembly

Digital normalization

Two benefits:

1) Decrease amount of data (real, but redundant sequence)

2) Eliminate errors associated that redundant sequence.

Single-pass algorithm (c.f. streaming sketch algorithms)

Page 16: Scaling metagenome assembly

Digital normalization validation?

Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated.

Page 17: Scaling metagenome assembly

Comparing assemblies quantitatively

Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA.

This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.

Page 18: Scaling metagenome assembly

Running HMMs over de Bruijn graphs(=> cross validation)

hmmgs: Assemble based on good-scoring HMM paths through the graph.

Independent of other assemblers; very sensitive, specific.

95% of hmmgs rplB domains are present in our partitioned assemblies.

GAC A

CC

ACT

GTAA

TAG

TT

CTCTTC

CTA

Jordan Fish, Qiong Wang, and Jim Cole (RDP)

Page 19: Scaling metagenome assembly

Digital normalization validation

Two independent methods for comparing assemblies… by both of them, we get very similar results for raw and treated.

Hmmgs results tell us that Velvet multi-k assembly is also very sensitive.

Our primary concern at this point is about long-range artifacts (chimeric assembly).

Page 20: Scaling metagenome assembly

Techniques

Developed suite of techniques that work for scaling, without loss of information (?)

While we have no good way to assess chimeras and misassemblies, basic sequence content and gene content stay the same across treatments.

And… what, are we just sitting here writing code?

No! We have data to assemble!

Page 21: Scaling metagenome assembly

Assembling Great Prairie data, v0.8

Iowa corn GAII, ~500m reads / 50 Gb => largest partition ~200k reads

84 Mb in 53,501 contigs > 1kb.

Iowa prairie GAII, ~500m reads / 50 Gb => biggest ~100k read partition

102 MB in 70,895 contigs > 1kb.

Both done on a single 8-core Amazon EC2 bigmem node, 68 GB of RAM, ~$100.

(Yay, we can do it! Boo, we’re only using 2% of reads.)

No systematic optimization of partitions yet; 2-4x improvement expected. Normalization of HiSeq is also yet to be done.

Have applied to other metagenomes, note; longer story.

Page 22: Scaling metagenome assembly

Future directions?

khmer software reasonably stable & well-tested; needs documentation, software engineering love.

github.com/ctb/khmer/ (see ‘refactor’ branch…)

Massively scalable implementation (HPC & cloud). Scalable digital normalization (~10 TB / 1 day? ;) Iterative partitioning

Integrating other types of sequencing data (454, PacBio, …)?

Polymorphism rates / error rates seem to be quite a bit higher.

Validation and standard data sets? Someone? Please?

Page 23: Scaling metagenome assembly

Lossless assembly; boosting.

Page 24: Scaling metagenome assembly

Acknowledgements:

The k-mer gang:Adina Howe, Jason Pell, Arend

Hintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom.

mRNAseq:Likit Preeyanon, Alexis Pyrkosz,

Hans Cheng, Billie Swalla, and Weiming Li.

HMM graph search:Jordan Fish, Qiong Wang, Jim Cole.

Great Prairie consortium:Jim Tiedje, Rachel Mackelprang,

Susannah Tringe, Janet Jansson

Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

Page 25: Scaling metagenome assembly

Acknowledgements:

The k-mer gang:Adina Howe, Jason Pell, Arend

Hintze, Qingpeng Zhang, Rose Canino-Koning, Tim Brom.

mRNAseq:Likit Preeyanon, Alexis Pyrkosz,

Hans Cheng, Billie Swalla, and Weiming Li.

HMM graph search:Jordan Fish, Qiong Wang, Jim Cole.

Great Prairie consortium:Jim Tiedje, Rachel Mackelprang,

Susannah Tringe, Janet Jansson

Funding: USDA NIFA; MSU, startup and iCER; DOE; BEACON/NSF STC; Amazon Education.

Page 26: Scaling metagenome assembly
Page 27: Scaling metagenome assembly

Lumps!

Adina Howe

Page 28: Scaling metagenome assembly

Lumps!

Adina Howe

Page 29: Scaling metagenome assembly

Knots in the graph are caused by sequencing artifacts.

Page 30: Scaling metagenome assembly

Identifying the source of knots

Use a systematic traversal algorithm to identify highly-connected k-mers.

Removal of these k-mers (trimming) breaks up the knots.

Many, but not all, of these highly-connected k-mers are associated with high-abundance k-mers.

Page 31: Scaling metagenome assembly

Highly connected k-mers are position-dependent

Adina Howe

Page 32: Scaling metagenome assembly

HCKs under-represented in assembly

Adina Howe

Page 33: Scaling metagenome assembly

HCKs tend to end contigs

Adina Howe

Page 34: Scaling metagenome assembly

Our current model

1) Contigs are extended or joined around artifacts, with an observation bias towards such extensions (because of length cutoff).

2) Tendency is for a long contig to be extended by 1-2 reads, so artifacts trend towards location at end of contig.Adina Howe

Page 35: Scaling metagenome assembly

Conclusions (artifacts)

They connect lots of stuff (preferential attachment)

They result from something in the sequencing (3’ bias in reads)

Assemblers don’t like using them

The major effect of removing them is to shorten many contigs by a read.

Page 36: Scaling metagenome assembly

Digital normalization algorithm

for read in dataset:if median_kmer_count(read) < CUTOFF:

update_kmer_counts(read)save(read)

else:# discard read

Page 37: Scaling metagenome assembly

Supplemental: abundance filtering is very lossy.

Total

3.8x partition

8.2x partition

Largest partition

0.0 20.0 40.0 60.0 80.0 100.0

Percent loss from abundance filtering (all >= 2)

contigsbp

Percentage lost

Page 38: Scaling metagenome assembly

Per-partition assembly optimization

Strategy:Vary k from 21 to 51, assemble with velvet.

Choose k that maximizes sum(contigs > 1kb)

Ran top partitions in Iowa corn (4.2m reads, 303 partitions)

For k=33, 3.5 mb in 1876 contigs > 1kb, max 15.7 kb

For best k for each partition (varied between 31 and 47),5.7 mb in 2511 contigs > 1kb, max 51.7 kb

Page 39: Scaling metagenome assembly

Comparing assemblies quantitatively

Build a “vector basis” for assemblies out of orthogonal M-base windows of DNA.

This allows us to disassemble assemblies into vectors, compare them, and even “subtract” them from one another.

Page 40: Scaling metagenome assembly

Comparing assemblies / dendrogram