spark summit europe: share and analyse genomic data at scale

Share and analyse genomic data at scalewith Spark, Adam, Tachyon & the Spark Notebookby @DataFellas, Oct • 29th • 2015

Outline● Sharp intro to Genomics data● What are the Challenges● Distributed Machine Learning to the rescue

● Projects: Distributed teams● Research: Long process● Towards Maximum Share for efficiency

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)

Analyse Genomic At ScaleSpark, Adam, Spark Notebook

➔ Sharp intro to Genomics data➔ What are the Challenges➔ Distributed Machine Learning to the rescue

What is genomics data?DNA?

What makes us what we are…

… a complex biochemical soup.

With applications to medical diagnostics, drug response,disease mechanisms

On the production side

Fast biotech progress…

… can IT keep up?


Sequence {A, T, G, C}

3 billion characters (bases)


Sequence {A, T, G, C}

3 billion characters (bases)

… x 30 (x 60)

Massively parallel

Lots of data?

Lots of data?

10’s millions

Lots of data!

10’s millions

1,000s1,000,000s...

ADAM: Spark genomics library

http://www.bdgenomics.org

Matt MassieFrank Nothaft




Avro schema

Parquet storage

Genomics API

So what do we do with this? Study variations between populations

Descriptive statistics

Machine Learning (Population stratification or Supervised learning)

… and share and replay!

The Spark Notebook … comes to the rescue.

+ Self described and consistent+ Easily shared (code)

+ Scala (types, production quality)+ Reactive&pluggage charts API (scala = no.js)+ easy install, no deps.+ multiple sparkContext

http://www.spark-notebook.io

The Spark Notebook

So what do we do with this?

… and share and replay!

Code can be shared easily but we want more...

How do we share data produced by the notebook?

How do we publish the notebook as a service?

Share Genomic At ScaleSpark, Tachyon, Mesos, Shar3

➔ Projects: Distributed teams➔ Research: Long process➔ Towards Maximum Share for efficiency

Projects

Intrinsically involving many teams

geolocally distributed in different countries or laboratories

with different skills inBiology, Genetics, I.T., Medicine (, legal...)

Projects

Require many types of data ranging frombio samplesimagerytextualarchives/historical

ProjectsOf course

Generally gather many people from several populations

Note: This is very expensive and burns $time as hell!

Projects1.000 genomes (2008-2012): 200To

100.000 genomes (2013-2017): 20Po (probably more)

1.000.000 genomes (2016-2020): 0.2Eo (probably more)

eQTL: mixing many sources

ProjectsNeed proper data management between entities, yet coping with:

amount of dataheterogeneity of people

distance between actorsconstraints related to data location

ProjectsDistributed friendly

SCHEMAS + BINARY

f.i. Avro

ResearchResearch in medicine or health in general is

LOOOOOOO…OOOOONG

ResearchMost reasons are quite obvious and must not be overlooked

Lots of measures and validationLots of control (including by Gov.)

Lots of actors

ResearchAs a matter of fact, research needs

to be conducted on data and to produce results

And both are extremely exposed to reuseSo what if we lose either of them?

ResearchHowever, we can get into troubles instantly

without even losing them!

What if we don’t track the processes?

In any scientific process: confrontation, replay and enhancement are keys to move forward

This is misleading to think that sharing the code is enough.

Remind: we look for data and results, not for code.

The process includes the code, the context, the sources and so on, and all should be part of the data discovery/validation task

Research

Assess the risk factor associated with a disease given mutations of a certain gene.

More than 50 years of data collecting and modelling.

Hundreds of researchers, each generation has new ideas.

Replaying old processes on new data,new processes on old data

Research

Share share share

All these facts relate to our capacity to share our work and to collaborate.

We need to share efficiently and accurately the★ data★ processes★ results

Share share share

The challenge resides in the workflow

Share share share

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Share share share

Streamlining development lifecycle for better Productivity with Shar3

Share share share

Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

That’s all folksThanks for listening/staying

Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab

Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)

Check also @TypeSafe: http://t.co/o1Bt6dQtgH

http://data-fellas.guru

https://docs.google.com/forms/d/151AKOlZTrmu3JZSPfbHpUSzk7mAaeE-SJk9hLsDUpYo/viewform

http://t.co/o1Bt6dQtgH

spark summit europe: share and analyse genomic data at scale

Technology