using explicit control processes in distributed workflows to gather provenance sergio m. s. cruz...

40
Using explicit control processes in distributed workflows to gather provenance Sergio M. S. Cruz Fernando Seabra Chirigati Rafael Dahis Maria Luiza M. Campos Marta Mattoso Federal University of Rio de Janeiro, Brazil UFRJ

Upload: curtis-bond

Post on 30-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Using explicit control processes in distributed workflows to gather provenance

Sergio M. S. CruzFernando Seabra ChirigatiRafael DahisMaria Luiza M. CamposMarta Mattoso

Federal University of Rio de Janeiro, Brazil

UFRJ

Agenda

• Introduction

• Motivation Control flow in data centric workflows

• Objective Provenance Gathering in Distributed Workflows with Explicit

Control Flows

• Case of Use Control Flow on VisTrails

• Conclusion

Distribution & Heterogeneity in Workflows

• Scientific Wf enables data intensive analyses Use of grid x remote parallel machinesUse of different WfMS

- Different provenance capture mechanismsUse Centralized x Distributed WfMS

- often offer disjoint set of capabilities

How to obtain a homogeneous provenance representation and capture mechanism?

Control flow matters in data centric workflows• Scientific workflows also need control structures to

specify how the data flow should be directed

• Goderis et al. [6] stress the importance of combining different models of computation in one scientific workflow

• Bowers et al. [5] say that: “modeling control-flow using only dataflow constructs can

quickly lead to overly complex workflows that are hard to understand, reuse, reconfigure, maintain, and schedule”

• Tudruj et al. [7] state the importance of general dynamic control flow, but focus on synchronization of parallel execution

Presented a set of generic control structures and proposed the use of a monitoring middleware

A real example: OrthoSearch workflow

Detect distant homologies

on five parasites associated with

tropical neglected diseases

BLAST

MAFFT/HMMERpackages

Best Hits Finder

FormatDB

InterPRO

OrthoSearch specification in Kepler

• Some lighweight tasks can run locally

• Suppose we need to execute MAFFT/HMMER in a High Performance Environment

• Just send it to a grid !

Time consuming tasks

BLAST

MAFFT/HMMERpackages

Best Hits Finder

FormatDB

InterPRO

OrthoSearch - loops, choice, …

How to map this to the grid language ?

LOCAL BLAST

MAFFT/HMMERpackages

Best Hits Finder

FormatDB

InterPRO

OrthoSearch - loops, choice, …

Alternatively, send one job at a time to execute remotely

Can be very inefficient !

OrthoSearch - loops, choice, …

Rewrite this to the grid language.e.g. Triana, supports loops !

But, how to bring provenance data back to Kepler ?

How to register loop iterations ?

OrthoSearch - loops, choice: other issues

What if my available grid does not have a WfMS ?

What if my available grid supports another WfMS ?

What if the grid WfMS does not support loops ?

Generic control flow modules with remote provenance gathering!

Motivation

• Workflow design Different WfMS present their own control structures, parallel

execution models, etc.- Expose different modeling semantics to the users!

• Provenance gathering WfMS register provenance in their own schema Often encompassing specific grid features Based on application domain attributes

A lot of mappings and conversions!

Many challenges in changing WfMS for the same workflow

Objective

• Diminish the dependence of the workflow definition on the WfMS

uncoupling the provenance gathering system from the WfMS

having some control flow of execution independent of the WfMS workflow specification language

• Plugging control flow and provenance gathering modules along the workflow original tasks

the workflow specification can be executed almost independently of the current WfMS

provenance can be gathered uniformly

Scientific Workflow Control Flows

• A small set of generic workflow-level control modules

• Based on workflow patterns by Van der Aalst et al.

Workflow Pattern Module

Structured Discriminator Mux

Exclusive Choice Demux

Deferred Choice String Control

Multiple Instances without synchronization

Number Control

Synchronization Number Compare

Exclusive Choice If

Scientific Workflow Control Flows

COGsDB

MAFFT hmmbuild

fastacmdformatdb

hmmsearch

hmmcalibrate

PtnDB

ReciprocalsBest Hits Finder

InterPROReannotated genes

hmmpfam

HMMER

BLAST

Implicit DECISION

Implicit LOOP

Scientific Workflow with Explicit Control Flows

Explicit LOOP

MAFFT hmmbuild

hmmsearch

hmmcalibrate

hmmpfam

HMMER

Initial condition

MUX

IFT F

• All these modules can be sent to execute in any HPC environment

• Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules

Explicit DECISION

Meta-Workflow eases migration of a Wf from WfMS to another!

Control flow modules on VisTrails

• All these control flow modules were made available on Vistrails

• More explicit control is now available

• Remote execution can keep specified control

• Remote execution can bring provenance data back to Vistrails with compatible structure

Advantages

Orthosearch on VisTrails

• All these inner modules (sub-workflow) can be sent to execute in a grid or HPC environment

• Provenance gathering mechanisms can be inserted in the control flow modules or other specific modules

• In Vistrails the loop could not be implemented because it is a DAG based WfMS

External LOOP(parameter

exploration)

Explicit DECISION

Scientific Workflow - Heterogeinity

COGsDB

MAFFT hmmbuild

fastacmdformatdb

hmmsearch

hmmcalibrate

PtnDB

ReciprocalsBest Hits Finder

InterPROReannotated genes

hmmpfam

HMMER

BLAST

Tim

e

con

su

min

g

Orthosearch on VisTrails

• BLAST modules should be sent to execute in PC cluster

• Provenance gathering mechanisms can be inserted in the control flow modules to be sent to the parallel environement

• In Vistrails this can be achieved using the MidMon modules

REMOTE PARALLEL EXECUTION

BLAST

MidMon on VisTrails

Monitoring tool that checks scientific processes running on distributed environments• Message exchange-based tool• Decoupled and present modular infrastructure• Support to legacy applications on distributed resources

Implementation

Data Modules

Control Modules

BLAST

Concluding

• We share the same motivation of Bowers et al., Goderis et al. and Tudruj et al.

• And the same as Groth et al.

• We propose: A set of generic control-flow structures independent of WfMS

• Our implementation has shown that: Control-flow structures can allow generic sub-workflow remote

execution Remote process provenance can be captured in the same

representation of the wf Workflow refactoring is facilitated Control-flow structures can be coupled to monitoring middleware

Using explicit control flow

Provenance independent of a WfMS

Conclusion

• Distribution & Heterogeneity are inevitable in scientific workflows

• Adding control-flow modules to the scientific workflow specification can help the execution by heterogeneous WfMS running on distributed environments

Acts as documentation of the execution control workflow Allows to evaluate and monitor the activities of the workflow Helps to gather provenance from heterogeneous and

independent environments with low programming efforts

• MidMon on top of VisTrails Enable scientists to monitor the submitted jobs status on

their desktops Preserves workflows’ original features

Future work

• Use workflow views, e.g. ZOOM* Our solution makes the workflow very verbose

• Use software component reuse and refactoring techniques to help the automatic incorporation of these modules

“Using Provenance to Improve Workflow Design” Tosta et al.

• Work with other workflows from bioinformatics and oil industry

Using explicit control processes in distributed workflows to gather provenance

Sergio M. S. da CruzFernando Seabra ChirigatiRafael DahisMaria Luiza M. CamposMarta Mattoso

Federal University of Rio de Janeiro, Brazil

Thanks !

Scientific Workflow Control Flows

• A small set of generic workflow-level control modules

• Based on workflow patterns by Van der Aalst et al.

Workflow Pattern Module

Structured Discriminator Mux

Exclusive Choice Demux

Deferred Choice String Control

Multiple Instances without synchronization

Number Control

Synchronization Number Compare

Exclusive Choice If

MUXDescribes a convergence between two or more input ports, resulting in just one branch

Scientific Workflow Control Flows

• A small set of generic workflow-level control modules

• Based on workflow patterns by Van der Aalst et al.

Workflow Pattern Module

Structured Discriminator Mux

Exclusive Choice Demux

Deferred Choice String Control

Multiple Instances without synchronization

Number Control

Synchronization Number Compare

Exclusive Choice If

DEMUXRepresents an incoming branch that diverges into two or more parts. Just one of the outgoing branches is enabled depending on a condition associated

Scientific Workflow Control Flows

• A small set of generic workflow-level control modules

• Based on workflow patterns by Van der Aalst et al.

Workflow Pattern Module

Structured Discriminator Mux

Exclusive Choice Demux

Deferred Choice String Control

Multiple Instances without synchronization

Number Control

Synchronization Number Compare

Exclusive Choice If

STRING CONTROLThe workflow is divided in two or more branches, and just one of them can be enabled; the other outgoing branches are withdrawn

Scientific Workflow Control Flows

• A small set of generic workflow-level control modules

• Based on workflow patterns by Van der Aalst et al.

Workflow Pattern Module

Structured Discriminator Mux

Exclusive Choice Demux

Deferred Choice String Control

Multiple Instances without synchronization

Number Control

Synchronization Number Compare

Exclusive Choice If

NUMBER CONTROLAll output data are originatedsimultaneously

Scientific Workflow Control Flows

• A small set of generic workflow-level control modules

• Based on workflow patterns by Van der Aalst et al.

Workflow Pattern Module

Structured Discriminator Mux

Exclusive Choice Demux

Deferred Choice String Control

Multiple Instances without synchronization

Number Control

Synchronization Number Compare

Exclusive Choice If

NUMBER COMPARETwo or more incoming branchesbecome one outgoing branch, which will be only enabled after the complete activation of all the input data.

Scientific Workflow Control Flows

• A small set of generic workflow-level control modules

• Based on workflow patterns by Van der Aalst et al.

Workflow Pattern Module

Structured Discriminator Mux

Exclusive Choice Demux

Deferred Choice String Control

Multiple Instances without synchronization

Number Control

Synchronization Number Compare

Exclusive Choice If

IFSame pattern of the DemuxBut present two differences : If has only two input ports and has a logical expression, where the scientists can create any condition they need.

MidMon

• Offer a generic and lightweight monitoring tool that checks scientific processes running on distributed environments

Message exchange-based, 2 layered modular infrastructure

Decoupled and lightweight, crossing different network boundaries

Easy to deploy and manage Support to legacy applications on distributed resources

Midmon Monitoring Data

• state data may be possible to be monitored

• it may be possible to monitor about the state of the environment

• it may be possible to monitor about service availability

Midmon – State Data

• List of task state data that it may be possible to monitor:

Progress of a service - Rely on check points within the service, or a service may be able to provide an estimate of its progress

Completion of a service - This could be a simple event that indicates that a service has produced all of its output file

Data consumption rate of a service - This is a measure of the rate at which service is consuming data from input file

Data production rate of a service - This is a measure of the rate at which service is generating data for output file

Midmon – State of the environment

• A list of the useful data that it may be possible to monitor about the state of the environment is:

Available execution nodes - This could be a list of changes in the available execution nodes in the environment

Load on an execution node - This is a measure of the load in a execution node. It could be one, or a tuple, or a composite of services, e.g., the CPU load, the number of processes, and the free resources of the execution node

Load on a network link - This is a measure of the usage of a network link, in terms of the available bandwidth

Memory usage on an execution node - This is a measure of the usage of memory in a execution node

Midmon – Service availability

• The following is a list of useful data that it may be possible to monitor about service availability

Available services - This could be a list of the services available as mapping targets for tasks in a workflow. The data could also include, e.g., the status of services currently deployed

Available data resources. This could be a list of the data resources available as mapping targets for inputs and outputs in a workflow

OrthoSearch – SSH version

• Without Control-Flow modules

hmmPFam

hmmSearch

OrthoSearch on Kepler 1/3

FormatDB

FastaCmd

OrthoSearch on Kepler 2/3

InterPro

OrthoSearch on Kepler 3/3