experiencing apache spark in genomics - hpc advisory council · 2020-01-15 · experiencing apache...
TRANSCRIPT
![Page 1: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/1.jpg)
Experiencing Apache
Spark in Genomics
Zhong Wang, Ph.D.
Group Lead, Genome Analysis
02/20/2018
![Page 2: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/2.jpg)
Metagenome is the genome of a microbial
community
![Page 3: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/3.jpg)
Microbial communities are “dark matters”
Number of Species
Cow
~6000Human
~1000Soil,
>100000
>90% of the species haven’t been seen before
![Page 4: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/4.jpg)
Metagenome sequencing
Harvestmicrobes
ExtractDNA
Shear, &Sequencing
Assembly
Short Reads
Reconstructed genomes
Microbes Genomic DNA
![Page 5: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/5.jpg)
Metagenome assembly
Library of Books Shredded Library “reconstructed” Library
Genome ~= Book Metagenome ~= Library
Sequencing ~= sampling the pieces
![Page 6: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/6.jpg)
Scale is an enemy
1
10
100
1,000
10,000
100,000
1,000,000
Common Human Cow Soil
Gigabases (Gb)
![Page 7: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/7.jpg)
Complexity is another…
Remove contaminants, sequencing errors
Overlap graphde bruijn graph
Contigs or clustersRepetitive elementsHomologous genesHorizontal transferred genes
![Page 8: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/8.jpg)
The ideal solution and the failed ones
Easy to develop Robust Scale to big data Efficient
BigMem
• Easy to develop
• Expensive
• Not scale
MPI
• Fast
• Hard to develop
• Not robust
Hadoop
• Easy to develop
• Scale
• Slow
![Page 9: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/9.jpg)
Addressing big data: Apache Spark
• New scalable programming paradigm• Compatible with Hadoop-supported
storage systems • Improves efficiency through:
• In-memory computing primitives• General computation graphs
• Improves usability through:• Rich APIs in Java, Scala, Python• Interactive shell
Scale to big data
Efficient
Easy to develop
Robust?
![Page 10: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/10.jpg)
Goal: Metagenome read clustering
Read clustering can reduce metagenome
problem to single-genome problem
• Parallel Processing
• Individualized optimization
Reads Read clusters
![Page 11: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/11.jpg)
Algorithm
2 3
1
Node: ReadEdge: number of k-mers two reads share
Read graph containing all reads Graph Partitioning: LPA
Kmer-mapping reads (KMR)
Graph Construction and Edge Reduction (Edges) LPA
![Page 12: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/12.jpg)
Testing datasets
Human Alzheimer
Transcriptome
Cow Rumen
metagenome
Data type Transcriptome Metagenome
# species ~20,000 >=10,000
Repetitive content medium high
# known species high low
Read type PacBio Illumina
Read length (bases) 0.3-30,000 2x150
# reads 2 million 1.2 billion
Data size 7.6 GB 1 TB
![Page 13: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/13.jpg)
High accuracy on a controlled dataset
![Page 14: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/14.jpg)
Hardware and software environments
OTC EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Hadoop 2.7.3 2.7.3 2.7.2
Spark 2.1.1 2.2.0 2.1.0
![Page 15: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/15.jpg)
Cow rumen: scale up to big data
0
200
400
600
800
20 40 60 80 100
Execu
tio
n T
ime (
min
s)
Data Size (GB)
KMR Edges LPA Total
OTC
![Page 16: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/16.jpg)
Increasing nodes
0
100
200
300
400
500
25 50 75 100
Exe
cu
tio
n T
ime
(m
ins)
Number of nodes
50G Cow Rumen on EMR
KMR Edges
LPA Total
0
40
80
120
160
5 10 15 20
Exec
uti
on
Tim
e (m
ins)
Number of nodes
10G Cow Rumen on EMR
KMR Edges
LPA Total(mins)
![Page 17: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/17.jpg)
Fine tune parallelism
0
50
100
150
200
250
300
350
1 2 3 4 5 6 7 8
Execu
tio
n T
Ime (
min
s)
Spark default parallelism (log10)
50G 20G
![Page 18: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/18.jpg)
Dataset complexity vs performance
146.33
44.5
0
20
40
60
80
100
120
140
160
Human Iso-SeqAlzheimer(PacBio)
Cow Rumen(Illumina)Execu
tio
n T
ime (
min
s)
LPA
Edges
KMR
![Page 19: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/19.jpg)
Platform comparison: Cloud vs HPC
OTC EMR Bridge
nodes 20 20 8
cores 8 (160) 8 (160) 28 (224)
memory 64 (1280) 61 (1220) 128 (1024)
Time (min) 106 105 126
![Page 20: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/20.jpg)
Overall impression of Spark
✓ Easy to develop✓ Robust✓ Scale to big data✓ Flexible (cloud, HPC)? Efficient
✓ VS Hadoop/PIG▪ VS MPI?
? Accuracy✅ long reads� Short reads need optimization
![Page 21: Experiencing Apache Spark in Genomics - HPC Advisory Council · 2020-01-15 · Experiencing Apache Spark in Genomics Zhong Wang, Ph.D. Group Lead, Genome Analysis 02/20/2018](https://reader034.vdocuments.mx/reader034/viewer/2022050121/5f515b83e5f918157102d2b1/html5/thumbnails/21.jpg)
Acknowledgements
Spark TeamLizhen Shi @FSU
Xiandong Meng
Lisa Gerhardt , Evan Racah
@ NERSC
Yong Qin, Gary Jung,
Greg Kurtzer, Bernard Li, @ HPC
Philip Blood,
Bryon Gill @PSC