spark summit europe: share and analyse genomic data at scale

43
Share and analyse genomic data at scale with Spark, Adam, Tachyon & the Spark Notebook by @DataFellas, Oct • 29th • 2015

Upload: andy-petrella

Post on 15-Feb-2017

532 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Spark Summit Europe: Share and analyse genomic data at scale

Share and analyse genomic data at scalewith Spark, Adam, Tachyon & the Spark Notebookby @DataFellas, Oct • 29th • 2015

Page 2: Spark Summit Europe: Share and analyse genomic data at scale

Outline● Sharp intro to Genomics data● What are the Challenges● Distributed Machine Learning to the rescue

● Projects: Distributed teams● Research: Long process● Towards Maximum Share for efficiency

Page 3: Spark Summit Europe: Share and analyse genomic data at scale

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

“There must be another way of doing the credits” -- Robin Hood: Men in Tights (1993, Mel Brooks)

Page 4: Spark Summit Europe: Share and analyse genomic data at scale

Analyse Genomic At ScaleSpark, Adam, Spark Notebook

➔ Sharp intro to Genomics data➔ What are the Challenges➔ Distributed Machine Learning to the rescue

Page 5: Spark Summit Europe: Share and analyse genomic data at scale

What is genomics data?DNA?

What makes us what we are…

… a complex biochemical soup.

With applications to medical diagnostics, drug response,disease mechanisms

Page 6: Spark Summit Europe: Share and analyse genomic data at scale

What is genomics data?DNA?

What makes us what we are…

… a complex biochemical soup.

With applications to medical diagnostics, drug response,disease mechanisms

Page 7: Spark Summit Europe: Share and analyse genomic data at scale

On the production side

Fast biotech progress…

… can IT keep up?

Page 8: Spark Summit Europe: Share and analyse genomic data at scale

On the production side

Sequence {A, T, G, C}

3 billion characters (bases)

Page 9: Spark Summit Europe: Share and analyse genomic data at scale

On the production side

Sequence {A, T, G, C}

3 billion characters (bases)

… x 30 (x 60)

Massively parallel

Page 10: Spark Summit Europe: Share and analyse genomic data at scale

Lots of data?

Page 11: Spark Summit Europe: Share and analyse genomic data at scale

Lots of data?

10’s millions

Page 12: Spark Summit Europe: Share and analyse genomic data at scale

Lots of data!

10’s millions

1,000s1,000,000s...

Page 13: Spark Summit Europe: Share and analyse genomic data at scale

ADAM: Spark genomics library

http://www.bdgenomics.org

Matt MassieFrank Nothaft

Page 14: Spark Summit Europe: Share and analyse genomic data at scale

ADAM: Spark genomics library

Page 15: Spark Summit Europe: Share and analyse genomic data at scale

ADAM: Spark genomics library

Page 16: Spark Summit Europe: Share and analyse genomic data at scale

ADAM: Spark genomics library

Page 17: Spark Summit Europe: Share and analyse genomic data at scale

ADAM: Spark genomics library

Avro schema

Parquet storage

Genomics API

Page 18: Spark Summit Europe: Share and analyse genomic data at scale

So what do we do with this? Study variations between populations

Descriptive statistics

Machine Learning (Population stratification or Supervised learning)

… and share and replay!

Page 19: Spark Summit Europe: Share and analyse genomic data at scale

The Spark Notebook … comes to the rescue.

+ Self described and consistent+ Easily shared (code)

+ Scala (types, production quality)+ Reactive&pluggage charts API (scala = no.js)+ easy install, no deps.+ multiple sparkContext

http://www.spark-notebook.io

Page 20: Spark Summit Europe: Share and analyse genomic data at scale

The Spark Notebook

Page 21: Spark Summit Europe: Share and analyse genomic data at scale

The Spark Notebook

Page 22: Spark Summit Europe: Share and analyse genomic data at scale

The Spark Notebook

Page 23: Spark Summit Europe: Share and analyse genomic data at scale

So what do we do with this?

… and share and replay!

Code can be shared easily but we want more...

How do we share data produced by the notebook?

How do we publish the notebook as a service?

Page 24: Spark Summit Europe: Share and analyse genomic data at scale

Share Genomic At ScaleSpark, Tachyon, Mesos, Shar3

➔ Projects: Distributed teams➔ Research: Long process➔ Towards Maximum Share for efficiency

Page 25: Spark Summit Europe: Share and analyse genomic data at scale

Projects

Intrinsically involving many teams

geolocally distributed in different countries or laboratories

with different skills inBiology, Genetics, I.T., Medicine (, legal...)

Page 26: Spark Summit Europe: Share and analyse genomic data at scale

Projects

Require many types of data ranging frombio samplesimagerytextualarchives/historical

Page 27: Spark Summit Europe: Share and analyse genomic data at scale

ProjectsOf course

Generally gather many people from several populations

Note: This is very expensive and burns $time as hell!

Page 28: Spark Summit Europe: Share and analyse genomic data at scale

Projects1.000 genomes (2008-2012): 200To

100.000 genomes (2013-2017): 20Po (probably more)

1.000.000 genomes (2016-2020): 0.2Eo (probably more)

eQTL: mixing many sources

Page 29: Spark Summit Europe: Share and analyse genomic data at scale

ProjectsNeed proper data management between entities, yet coping with:

amount of dataheterogeneity of people

distance between actorsconstraints related to data location

Page 30: Spark Summit Europe: Share and analyse genomic data at scale

ProjectsDistributed friendly

SCHEMAS + BINARY

f.i. Avro

Page 31: Spark Summit Europe: Share and analyse genomic data at scale

ResearchResearch in medicine or health in general is

LOOOOOOO…OOOOONG

Page 32: Spark Summit Europe: Share and analyse genomic data at scale

ResearchMost reasons are quite obvious and must not be overlooked

Lots of measures and validationLots of control (including by Gov.)

Lots of actors

Page 33: Spark Summit Europe: Share and analyse genomic data at scale

ResearchAs a matter of fact, research needs

to be conducted on data and to produce results

And both are extremely exposed to reuseSo what if we lose either of them?

Page 34: Spark Summit Europe: Share and analyse genomic data at scale

ResearchHowever, we can get into troubles instantly

without even losing them!

What if we don’t track the processes?

In any scientific process: confrontation, replay and enhancement are keys to move forward

Page 35: Spark Summit Europe: Share and analyse genomic data at scale

This is misleading to think that sharing the code is enough.

Remind: we look for data and results, not for code.

The process includes the code, the context, the sources and so on, and all should be part of the data discovery/validation task

Research

Page 36: Spark Summit Europe: Share and analyse genomic data at scale

Assess the risk factor associated with a disease given mutations of a certain gene.

More than 50 years of data collecting and modelling.

Hundreds of researchers, each generation has new ideas.

Replaying old processes on new data,new processes on old data

Research

Page 37: Spark Summit Europe: Share and analyse genomic data at scale

Share share share

All these facts relate to our capacity to share our work and to collaborate.

We need to share efficiently and accurately the★ data★ processes★ results

Page 38: Spark Summit Europe: Share and analyse genomic data at scale

Share share share

The challenge resides in the workflow

Page 39: Spark Summit Europe: Share and analyse genomic data at scale

Share share share

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 40: Spark Summit Europe: Share and analyse genomic data at scale

Share share share

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Page 41: Spark Summit Europe: Share and analyse genomic data at scale

Share share share

Streamlining development lifecycle for better Productivity with Shar3

Page 42: Spark Summit Europe: Share and analyse genomic data at scale

Share share share

Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Page 43: Spark Summit Europe: Share and analyse genomic data at scale

That’s all folksThanks for listening/staying

Poke us on Twitter or via http://data-fellas.guru@DataFellas @Shar3_Fellas @SparkNotebook@Xtordoir & @Noootsab

Building Distributed Pipelines for Data Science using Kafka, Spark, and Cassandra (form → @DataFellas)

Check also @TypeSafe: http://t.co/o1Bt6dQtgH