experiencing apache spark in genomics - hpc advisory council · 2020-01-15 · experiencing apache...

Experiencing Apache

Spark in Genomics

Zhong Wang, Ph.D.

Group Lead, Genome Analysis

02/20/2018

Metagenome is the genome of a microbial

community

Microbial communities are “dark matters”

Number of Species

Cow

～6000Human

～1000Soil,

>100000

>90% of the species haven’t been seen before

Metagenome sequencing

Harvestmicrobes

ExtractDNA

Shear, &Sequencing

Assembly

Short Reads

Reconstructed genomes

Microbes Genomic DNA

Metagenome assembly

Library of Books Shredded Library “reconstructed” Library

Genome ~= Book Metagenome ~= Library

Sequencing ~= sampling the pieces

Scale is an enemy

1

10

100

1,000

10,000

100,000

1,000,000

Common Human Cow Soil

Gigabases (Gb)

Complexity is another…

Remove contaminants, sequencing errors

Overlap graphde bruijn graph

Contigs or clustersRepetitive elementsHomologous genesHorizontal transferred genes

The ideal solution and the failed ones

Easy to develop Robust Scale to big data Efficient

BigMem

• Easy to develop

• Expensive

• Not scale

MPI

• Fast

• Hard to develop

• Not robust

Hadoop

• Easy to develop

• Scale

• Slow

Addressing big data: Apache Spark

• New scalable programming paradigm• Compatible with Hadoop-supported

storage systems • Improves efficiency through:

• In-memory computing primitives• General computation graphs

• Improves usability through:• Rich APIs in Java, Scala, Python• Interactive shell

Scale to big data

Efficient

Easy to develop

Robust?

Goal: Metagenome read clustering

Read clustering can reduce metagenome

problem to single-genome problem

• Parallel Processing

• Individualized optimization

Reads Read clusters

Algorithm

2 3

1

Node: ReadEdge: number of k-mers two reads share

Read graph containing all reads Graph Partitioning: LPA

Kmer-mapping reads (KMR)

Graph Construction and Edge Reduction (Edges) LPA

Testing datasets

Human Alzheimer

Transcriptome

Cow Rumen

metagenome

Data type Transcriptome Metagenome

# species ~20,000 >=10,000

Repetitive content medium high

# known species high low

Read type PacBio Illumina

Read length (bases) 0.3-30,000 2x150

# reads 2 million 1.2 billion

Data size 7.6 GB 1 TB

High accuracy on a controlled dataset

Hardware and software environments

OTC EMR Bridge

nodes 20 20 8

cores 8 (160) 8 (160) 28 (224)

memory 64 (1280) 61 (1220) 128 (1024)

Hadoop 2.7.3 2.7.3 2.7.2

Spark 2.1.1 2.2.0 2.1.0

Cow rumen: scale up to big data

0

200

400

600

800

20 40 60 80 100

Execu

tio

n T

ime (

min

s)

Data Size (GB)

KMR Edges LPA Total

OTC

Increasing nodes

0

100

200

300

400

500

25 50 75 100

Exe

cu

tio

n T

ime

(m

ins)

Number of nodes

50G Cow Rumen on EMR

KMR Edges

LPA Total

0

40

80

120

160

5 10 15 20

Exec

uti

on

Tim

e (m

ins)

Number of nodes

10G Cow Rumen on EMR

KMR Edges

LPA Total(mins)

Fine tune parallelism

0

50

100

150

200

250

300

350

1 2 3 4 5 6 7 8

Execu

tio

n T

Ime (

min

s)

Spark default parallelism (log10)

50G 20G

Dataset complexity vs performance

146.33

44.5

0

20

40

60

80

100

120

140

160

Human Iso-SeqAlzheimer(PacBio)

Cow Rumen(Illumina)Execu

tio

n T

ime (

min

s)

LPA

Edges

KMR

Platform comparison: Cloud vs HPC

OTC EMR Bridge

nodes 20 20 8

cores 8 (160) 8 (160) 28 (224)

memory 64 (1280) 61 (1220) 128 (1024)

Time (min) 106 105 126

Overall impression of Spark

✓ Easy to develop✓ Robust✓ Scale to big data✓ Flexible (cloud, HPC)? Efficient

✓ VS Hadoop/PIG▪ VS MPI?

? Accuracy✅ long reads� Short reads need optimization

Acknowledgements

Spark TeamLizhen Shi @FSU

Xiandong Meng

Lisa Gerhardt , Evan Racah

@ NERSC

Yong Qin, Gary Jung,

Greg Kurtzer, Bernard Li, @ HPC

Philip Blood,

Bryon Gill @PSC

experiencing apache spark in genomics - hpc advisory council · 2020-01-15 · experiencing apache...

Documents