![Page 1: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/1.jpg)
Share and analyse genomic data at scalewith Spark, Adam, Tachyon & the Spark Notebookby @DataFellas, Oct • 29th • 2015
![Page 2: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/2.jpg)
Outline● Sharp intro to Genomics data● What are the Challenges● Distributed Machine Learning to the rescue
● Projects: Distributed teams● Research: Long process● Towards Maximum Share for efficiency
![Page 3: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/3.jpg)
Andy Petrella
MathsGeospatialDistributed Computing
Spark NotebookTrainer Spark/ScalaMachine Learning
Xavier Tordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)trainer SparkMachine Learning
“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)
![Page 4: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/4.jpg)
Analyse Genomic At ScaleSpark, Adam, Spark Notebook
➔ Sharp intro to Genomics data➔ What are the Challenges➔ Distributed Machine Learning to the rescue
![Page 5: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/5.jpg)
What is genomics data?DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,disease mechanisms
![Page 6: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/6.jpg)
What is genomics data?DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,disease mechanisms
![Page 7: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/7.jpg)
On the production side
Fast biotech progress…
… can IT keep up?
![Page 8: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/8.jpg)
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
![Page 9: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/9.jpg)
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
… x 30 (x 60)
Massively parallel
![Page 10: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/10.jpg)
Lots of data?
![Page 11: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/11.jpg)
Lots of data?
10’s millions
![Page 12: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/12.jpg)
Lots of data!
10’s millions
1,000s1,000,000s...
![Page 13: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/13.jpg)
ADAM: Spark genomics library
http://www.bdgenomics.org
Matt MassieFrank Nothaft
![Page 14: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/14.jpg)
ADAM: Spark genomics library
![Page 15: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/15.jpg)
ADAM: Spark genomics library
![Page 16: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/16.jpg)
ADAM: Spark genomics library
![Page 17: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/17.jpg)
ADAM: Spark genomics library
Avro schema
Parquet storage
Genomics API
![Page 18: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/18.jpg)
So what do we do with this? Study variations between populations
Descriptive statistics
Machine Learning (Population stratification or Supervised learning)
… and share and replay!
![Page 19: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/19.jpg)
The Spark Notebook … comes to the rescue.
+ Self described and consistent+ Easily shared (code)
+ Scala (types, production quality)+ Reactive&pluggage charts API (scala = no.js)+ easy install, no deps.+ multiple sparkContext
http://www.spark-notebook.io
![Page 20: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/20.jpg)
The Spark Notebook
![Page 21: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/21.jpg)
The Spark Notebook
![Page 22: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/22.jpg)
The Spark Notebook
![Page 23: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/23.jpg)
So what do we do with this?
… and share and replay!
Code can be shared easily but we want more...
How do we share data produced by the notebook?
How do we publish the notebook as a service?
![Page 24: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/24.jpg)
Share Genomic At ScaleSpark, Tachyon, Mesos, Shar3
➔ Projects: Distributed teams➔ Research: Long process➔ Towards Maximum Share for efficiency
![Page 25: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/25.jpg)
Projects
Intrinsically involving many teams
geolocally distributed in different countries or laboratories
with different skills inBiology, Genetics, I.T., Medicine (, legal...)
![Page 26: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/26.jpg)
Projects
Require many types of data ranging frombio samplesimagerytextualarchives/historical
![Page 27: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/27.jpg)
ProjectsOf course
Generally gather many people from several populations
Note: This is very expensive and burns $time as hell!
![Page 28: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/28.jpg)
Projects1.000 genomes (2008-2012): 200To
100.000 genomes (2013-2017): 20Po (probably more)
1.000.000 genomes (2016-2020): 0.2Eo (probably more)
eQTL: mixing many sources
![Page 29: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/29.jpg)
ProjectsNeed proper data management between entities, yet coping with:
amount of dataheterogeneity of people
distance between actorsconstraints related to data location
![Page 30: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/30.jpg)
ProjectsDistributed friendly
SCHEMAS + BINARY
f.i. Avro
![Page 31: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/31.jpg)
ResearchResearch in medicine or health in general is
LOOOOOOO…OOOOONG
![Page 32: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/32.jpg)
ResearchMost reasons are quite obvious and must not be overlooked
Lots of measures and validationLots of control (including by Gov.)
Lots of actors
![Page 33: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/33.jpg)
ResearchAs a matter of fact, research needs
to be conducted on data and to produce results
And both are extremely exposed to reuseSo what if we lose either of them?
![Page 34: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/34.jpg)
ResearchHowever, we can get into troubles instantly
without even losing them!
What if we don’t track the processes?
In any scientific process: confrontation, replay and enhancement are keys to move forward
![Page 35: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/35.jpg)
This is misleading to think that sharing the code is enough.
Remind: we look for data and results, not for code.
The process includes the code, the context, the sources and so on, and all should be part of the data discovery/validation task
Research
![Page 36: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/36.jpg)
Assess the risk factor associated with a disease given mutations of a certain gene.
More than 50 years of data collecting and modelling.
Hundreds of researchers, each generation has new ideas.
Replaying old processes on new data,new processes on old data
Research
![Page 37: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/37.jpg)
Share share share
All these facts relate to our capacity to share our work and to collaborate.
We need to share efficiently and accurately the★ data★ processes★ results
![Page 38: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/38.jpg)
Share share share
The challenge resides in the workflow
![Page 39: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/39.jpg)
Share share share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
![Page 40: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/40.jpg)
Share share share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
![Page 41: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/41.jpg)
Share share share
Streamlining development lifecycle for better Productivity with Shar3
![Page 42: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/42.jpg)
Share share share
Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
![Page 43: Spark Summit Europe: Share and analyse genomic data at scale](https://reader031.vdocuments.mx/reader031/viewer/2022030211/58a40e9d1a28ab7d758b598b/html5/thumbnails/43.jpg)
That’s all folksThanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab
Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)
Check also @TypeSafe: http://t.co/o1Bt6dQtgH