festival of genomics 2016 london: analyze genomes: modeling and executing genome data processing...

20
Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines Cindy Perscheid Festival of Genomics London, Jan 19, 2016

Upload: matthieu-schapranow

Post on 16-Apr-2017

484 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Cindy Perscheid

Festival of Genomics

London, Jan 19, 2016

Page 2: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Perscheid, Schapranow

Recap Analyze Genomes Real-time Analysis of Big Medical Data

In-Memory Database

Extensions for Life Sciences

Data Exchange, App Store

Access Control, Data Protection

Fair Use

Statistical Tools

Real-time Analysis

App-spanning User Profiles

Combined and Linked Data

Genome Data

Cellular Pathways

Genome Metadata

Research Publications

Pipeline and Analysis Models

Drugs and Interactions

Modeling and Executing Genome Data Processing Pipelines

Drug Response Analysis

Pathway Topology Analysis

Medical Knowledge Cockpit Oncolyzer

Clinical Trial Recruitment

Cohort Analysis

...

Indexed Sources

Page 3: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

From Raw Genome Data to Analysis Results

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

■  Sequencing: Acquire digital DNA data

■  Alignment: Reconstruction of complete genome with snippets

■  Variant Calling: Identification of genetic variants

■  Data Annotation: Linking genetic variants with research findings

Page 4: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Not standardized

■  Not exchangeable

■  Concatenation of bash scripts reading from and writing to files

■  Requires IT expertise for

□  Setup

□  Error handling, and

□  Efficient processing and parallelization

■  Objective: Model, configure, and execute pipelines without involving IT experts

Genome Data Processing Pipelines State of the Art

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

bwa aln ref.fa sample.fastq | bwa samse ref.fa – sample.fastq | samtools view -Su - | samtools sort …

Page 5: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Modeling Genome Data Processing Pipelines

■  Pipeline Execution in the Worker Framework

Agenda

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Page 6: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Modeling Genome Data Processing Pipelines

■  Pipeline Execution in the Worker Framework

Agenda

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Page 7: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Modeling Genome Data Processing Pipelines BPMN 2.0 Example

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Start Event

End Event

Annotated Data Object

Parallel Gateway

Collapsed Subtask

Task

Multiple Task Instances Executed in Parallel

Page 8: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Graphical modeling notation

■  Compliant with BPMN 2.0 extended by

□  Modular structure

□  Degree of parallelization

□  Parameters and variables

■  Model descriptions (XPDL) are stored in IMDB

■  Model instances are transformed into graph structure executed by our worker framework

Modeling Genome Data Processing Pipelines BPMN 2.0 Extensions

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Page 9: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Modeling Genome Data Processing Pipelines From Model to Execution 1.  Design time (researcher, process expert)

□  Definition of parameterized process model

□  Uses graphical editor and available jobs

2.  Configuration time (researcher, lab assistant)

□  Select model and specify parameters

□  Results in model instance stored in database

3.  Execution time (researcher)

□  Select model instance

□  Specify execution parameters, e.g. input files

Modeling and Executing Genome Data Processing Pipelines

Perscheid, Schapranow

Page 10: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Results are imported into IMDB

■  Optimization reduced execution time by >50%

Modeling Genome Data Processing Pipelines Traditional vs. Optimized Approach

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Page 11: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Modeling of Genome Data Processing Pipelines

■  Worker Framework

Agenda

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Page 12: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Pipeline Execution Overview

Modeling and Executing Genome Data Processing Pipelines

In-Memory Database

Tasks

Scheduler

ID Pipeline Params 12 BWA xyz.fastq 13 BWAmem abc.fastq 14 Bowtie2 xyz.fastq

Worker

Worker

Subtasks Task ID Job Status Params

12 97 Split done xyz.fastq

12 98 Import todo abc.vcf

12 98 Import done abc.vcf

Webservice

. . .

1. Trigger task execution

2. Schedule subtasks

3. Execute subtasks

Page 13: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Pipeline Execution Software Components and Communication

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines Node

Worker Worker Worker

IMDB

Node

Worker Worker Worker

IMDB

Node

Worker Worker Worker

IMDB

Scheduler

Node

Worker Worker Worker

IMDB

Transmitter

Node

Worker Worker Worker

IMDB ...

Site I Site II VPN

UDP TCP

Shared File System Shared File System

Page 14: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Node

Worker Worker Worker

IMDB

■  Workers execute jobs one by one

■  Subtask execution status in IMDB:

□  Ready (0),

□  In Progress (1),

□  Done (2), or

□  Erroneous (3).

■  Jobs implemented as Python modules/classes

□  Can contain arbitrary code

□  Have access to IMDB

□  Can read/write to shared working directory

Pipeline Execution Runtime Layer - Worker

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Page 15: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Pipeline Execution Coordination Layer - Scheduler

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Scheduling Algorithm

Pipeline Executor

Ressource Allocator

Subtask Subtask Subtask i

Subtask k

Subtask m

m ..

Page 16: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Scheduling algorithms are plug-in software modules

□  “User-/Group-based” to let users execute their tasks on their local site only

□  “Priority First” to prefer important users

□  “High Throughput”, i.e. “shortest task first” to deal with high load

■  Scheduling algorithms can also be composed hierarchically

Pipeline Execution Scheduling Algorithms

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines Priority-based

High-throughput High-throughput

High-throughput

Site I Site II

Prio A Prio *

Group-based

Page 17: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  Maintains lists of running and idle nodes

■  Idle worker requests new sub task for its assigned groups

■  If there is no matching sub task, it sleeps until a new sub task gets ready

Pipeline Execution Resource Allocator

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

Node

Worker Worker Worker

IMDB

Site 1 Site 2

Node

Worker Worker Worker

IMDB

Node

Worker Worker Worker

IMDB

Working Working Waiting

Page 18: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Pipeline Execution Flexibility

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines Node

IMDB

Node

Worker Worker Worker

Node

Worker Worker Worker

IMDB

Scheduler

UDP TCP

File System Share

Page 19: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

■  All execution data is stored in IMDB

■  Temporary files on a shared file system

■  In case of any failure, the system-wide state can be restored

Pipeline Execution Recoverability

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines

IMDB

Pipeline Tasks Scheduler

Worker Worker

Worker Worker

Pipeline Subtasks

Events

Data

Page 20: Festival of Genomics 2016 London: Analyze Genomes: Modeling and Executing Genome Data Processing Pipelines

Thanks!

Hasso Plattner Institute Enterprise Platform & Integration Concepts

August-Bebel-Str. 88 14482 Potsdam, Germany

Dr. Matthieu-P. Schapranow [email protected]

http://we.analyzegenomes.com/

Cindy Perscheid, M. Sc. [email protected]

Perscheid, Schapranow

Modeling and Executing Genome Data Processing Pipelines