spark summit europe: share and analyse genomic data at scale
Post on 15-Feb-2017
532 Views
Preview:
TRANSCRIPT
Share and analyse genomic data at scalewith Spark, Adam, Tachyon & the Spark Notebookby @DataFellas, Oct • 29th • 2015
Outline● Sharp intro to Genomics data● What are the Challenges● Distributed Machine Learning to the rescue
● Projects: Distributed teams● Research: Long process● Towards Maximum Share for efficiency
Andy Petrella
MathsGeospatialDistributed Computing
Spark NotebookTrainer Spark/ScalaMachine Learning
Xavier Tordoir
PhysicsBioinformaticsDistributed Computing
Scala (& Perl)trainer SparkMachine Learning
“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)
Analyse Genomic At ScaleSpark, Adam, Spark Notebook
➔ Sharp intro to Genomics data➔ What are the Challenges➔ Distributed Machine Learning to the rescue
What is genomics data?DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,disease mechanisms
What is genomics data?DNA?
What makes us what we are…
… a complex biochemical soup.
With applications to medical diagnostics, drug response,disease mechanisms
On the production side
Fast biotech progress…
… can IT keep up?
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
On the production side
Sequence {A, T, G, C}
3 billion characters (bases)
… x 30 (x 60)
Massively parallel
Lots of data?
Lots of data?
10’s millions
Lots of data!
10’s millions
1,000s1,000,000s...
ADAM: Spark genomics library
http://www.bdgenomics.org
Matt MassieFrank Nothaft
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
ADAM: Spark genomics library
Avro schema
Parquet storage
Genomics API
So what do we do with this? Study variations between populations
Descriptive statistics
Machine Learning (Population stratification or Supervised learning)
… and share and replay!
The Spark Notebook … comes to the rescue.
+ Self described and consistent+ Easily shared (code)
+ Scala (types, production quality)+ Reactive&pluggage charts API (scala = no.js)+ easy install, no deps.+ multiple sparkContext
http://www.spark-notebook.io
The Spark Notebook
The Spark Notebook
The Spark Notebook
So what do we do with this?
… and share and replay!
Code can be shared easily but we want more...
How do we share data produced by the notebook?
How do we publish the notebook as a service?
Share Genomic At ScaleSpark, Tachyon, Mesos, Shar3
➔ Projects: Distributed teams➔ Research: Long process➔ Towards Maximum Share for efficiency
Projects
Intrinsically involving many teams
geolocally distributed in different countries or laboratories
with different skills inBiology, Genetics, I.T., Medicine (, legal...)
Projects
Require many types of data ranging frombio samplesimagerytextualarchives/historical
ProjectsOf course
Generally gather many people from several populations
Note: This is very expensive and burns $time as hell!
Projects1.000 genomes (2008-2012): 200To
100.000 genomes (2013-2017): 20Po (probably more)
1.000.000 genomes (2016-2020): 0.2Eo (probably more)
eQTL: mixing many sources
ProjectsNeed proper data management between entities, yet coping with:
amount of dataheterogeneity of people
distance between actorsconstraints related to data location
ProjectsDistributed friendly
SCHEMAS + BINARY
f.i. Avro
ResearchResearch in medicine or health in general is
LOOOOOOO…OOOOONG
ResearchMost reasons are quite obvious and must not be overlooked
Lots of measures and validationLots of control (including by Gov.)
Lots of actors
ResearchAs a matter of fact, research needs
to be conducted on data and to produce results
And both are extremely exposed to reuseSo what if we lose either of them?
ResearchHowever, we can get into troubles instantly
without even losing them!
What if we don’t track the processes?
In any scientific process: confrontation, replay and enhancement are keys to move forward
This is misleading to think that sharing the code is enough.
Remind: we look for data and results, not for code.
The process includes the code, the context, the sources and so on, and all should be part of the data discovery/validation task
Research
Assess the risk factor associated with a disease given mutations of a certain gene.
More than 50 years of data collecting and modelling.
Hundreds of researchers, each generation has new ideas.
Replaying old processes on new data,new processes on old data
Research
Share share share
All these facts relate to our capacity to share our work and to collaborate.
We need to share efficiently and accurately the★ data★ processes★ results
Share share share
The challenge resides in the workflow
Share share share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share share
“Create” Cluster
Find sources (context, quality, semantic, …)
Connect to sources (structure, schema/types, …)
Create distributed data pipeline/Model
Tune accuracy
Tune performances
Write results to Sinks
Access Layer
User Access
ops
data
ops data
sci
sci ops
sci
ops data
web ops data
web ops data sci
Share share share
Streamlining development lifecycle for better Productivity with Shar3
Share share share
Analysis
Production
DistributionRendering
Discovery
CatalogProject
Generator
Micro Service / Binary format
Schema for output
Metadata
That’s all folksThanks for listening/staying
Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab
Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)
Check also @TypeSafe: http://t.co/o1Bt6dQtgH
top related