common motifs in scientific workflows: an empirical...

14
Common motifs in scientific workflows: An empirical analysis Daniel Garijo , Pinar Alper , Khalid Belhajjame , Oscar Corcho , Yolanda Gil , Carole Goble ABSTRACT Workflow technology continues to play an important role as a means for specifying and enacting computational experiments in modern science. Reusing and re-purposing workflows allow scientists to do new experiments faster, since the workflows capture useful expertise from others. As workflow libraries grow, scientists face the challenge of finding workflows appropriate for their task, understanding what each workflow does, and reusing relevant portions of a given workflow. We believe that workflows would be easier to understand and reuse if high-level views (abstractions) of their activities were available in workflow libraries. As a first step towards obtaining these abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna, Wings, Galaxy and Vistrails. Our analysis has resulted in a set of scientific workflow motifs that outline (i) the kinds of data-intensive activities that are observed in workflows (Data-Operation motifs), and (ii) the different manners in which activities are implemented within workflows (Workflow-Oriented motifs). These motifs are helpful to identify the functionality of the steps in a given workflow, to develop best practices for workflow design, and to develop approaches for automated generation of workflow abstractions. 1. Introduction A scientific workflow is a template defining the set of tasks needed to carry out a computational experiment [1]. Scientific workflows have been increasingly used in the last decade as an in- strument for data intensive science. Workflows serve a dual func- tion: first, as detailed documentation of the scientific method used for an experiment (i.e. the input sources and processing steps taken for the derivation of a certain data item), and second, as re-usable, executable artifacts for data-intensive analysis. Scientific work- flows are composed of a variety of data manipulation activities such as data movement, data transformation, data analysis and data visualization to serve the goals of the scientific study. The composition is done through the constructs made available by the workflow system used, and is largely shaped by the function un- dertaken by the workflow and the environment in which the sys- tem operates. A variety of workflow systems, both open source (e.g. Taverna [2], Wings [3], Galaxy [4], Vistrails [5], Kepler [6], ASKALON [7]) and commercial (e.g. Pipeline Pilot 2 ) are in use in a variety of scientific disciplines such as genomics, astronomy, cheminformatics, etc. A workflow is a software artifact, and once developed and tested, it http://accelrys.com/products/pipeline-pilot/.

Upload: others

Post on 21-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

Common motifs in scientific workflows: An empirical analysis Daniel Garijo , Pinar Alper , Khalid Belhajjame , Oscar Corcho , Yolanda Gil , Carole Goble

A B S T R A C T

Workflow technology continues to play an important role as a means for specifying and enacting computational experiments in modern science. Reusing and re-purposing workflows allow scientists to do new experiments faster, since the workflows capture useful expertise from others. As workflow libraries grow, scientists face the challenge of finding workflows appropriate for their task, understanding what each workflow does, and reusing relevant portions of a given workflow. We believe that workflows would be easier to understand and reuse if high-level views (abstractions) of their activities were available in workflow libraries. As a first step towards obtaining these abstractions, we report in this paper on the results of a manual analysis performed over a set of real-world scientific workflows from Taverna, Wings, Galaxy and Vistrails. Our analysis has resulted in a set of scientific workflow motifs that outline (i) the kinds of data-intensive activities that are observed in workflows (Data-Operation motifs), and (ii) the different manners in which activities are implemented within workflows (Workflow-Oriented motifs). These motifs are helpful to identify the functionality of the steps in a given workflow, to develop best practices for workflow design, and to develop approaches for automated generation of workflow abstractions.

1. Introduction

A scientific workflow is a template defining the set of tasks needed to carry out a computational experiment [1]. Scientific workflows have been increasingly used in the last decade as an in­strument for data intensive science. Workflows serve a dual func­tion: first, as detailed documentation of the scientific method used for an experiment (i.e. the input sources and processing steps taken for the derivation of a certain data item), and second, as re-usable,

executable artifacts for data-intensive analysis. Scientific work­flows are composed of a variety of data manipulation activities such as data movement, data transformation, data analysis and data visualization to serve the goals of the scientific study. The composition is done through the constructs made available by the workflow system used, and is largely shaped by the function un­dertaken by the workflow and the environment in which the sys­tem operates.

A variety of workflow systems, both open source (e.g. Taverna [2], Wings [3], Galaxy [4], Vistrails [5], Kepler [6], ASKALON [7]) and commercial (e.g. Pipeline Pilot2) are in use in a variety of scientific disciplines such as genomics, astronomy, cheminformatics, etc. A workflow is a software artifact, and once developed and tested, it

http://accelrys.com/products/pipeline-pilot/.

Page 2: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

can be shared and exchanged between scientists. Other scientists can then reuse existing workflows in their experiments, e.g., as sub-workflows [8].

Workflow reuse presents several advantages [9]: allowing for principled attribution of established methods, improving qual­ity through incremental/evolutionary workflow development (by leveraging the expertise of previous users), and making scientific processes more efficient. Users can also re-purpose existing work­flows to adapt them to their needs [9]. Emerging workflow repos­itories such as myExperiment [10] and CrowdLabs [11] have made publishing and finding workflows easier, but scientists still face the challenges of understanding and reusing the available workflows.

A major difficulty in understanding workflows is their complex nature. A workflow may contain several scientifically-significant analysis steps, combined with other data preparation or result de­livery activities, and in different implementation styles depending on the environment and context in which the workflow is exe­cuted. This difficulty in understanding stands in the way of reusing workflows.

Through an analysis of the current practices in scientific work­flow development, we pursue the following objectives:

1. To reverse-engineer the set of current practices in workflow development through an empirical analysis.

2. To identify workflow abstractions that would facilitate under-standability and therefore effective reuse.

3. To detect potential information sources that can be used to in­form the development of tools for creating workflow abstrac­tions.

In this paper we present the result of an empirical analysis per­formed over 260 workflow descriptions from Taverna [2], Wings [3], Galaxy [4] and Vistrails [5]. Based on this analysis, we propose a catalog of domain independent conceptual abstractions for work­flow steps that we call scientific workflow motifs. Motifs are pro­vided through (i) a characterization of the kinds of data-operation activities that are carried out within workflows, which we refer to as Data-Operation motifs, and (ii) a characterization of the different manners in which those activity motifs are realized/implemented within workflows, which we refer to as Workflow-Oriented motifs.

This paper extends our previous work [12], which performed an analysis of 177 workflows from Wings and Taverna. The new contributions reported on in this paper are an extension of the re­lated work in Section 2, the addition and extension of scientific domains from Wings and Taverna workflows (Social Network Analysis, Astronomy and Domain Independent) in Sections 3 and 5; and the analysis of workflows from the Galaxy and Vistrails sys­tems among different domains (Genomics, Text Mining, Domain Independent and Medical Informatics). Finally, we have also revis­ited the motif catalog (Section 4), our previous results (Section 5) and conclusions (Section 7) according to our new findings.

2. Related work

Our motifs can be seen as higher-level patterns observed in scientific workflows. "Workflow patterns" have been extensively studied [13], where inventories of possible patterns are developed based on workflow constructs that are possible in different lan­guages, along with the ways to combine those constructs. Scientific workflows typically use a dataflow paradigm rather than a con­trol flow paradigm that is more typical of business workflows [ 14], and generic data-intensive usage patterns3 are described in [15]. Other classifications are based on the intrinsic properties of the

workflows (size, structure, branching factor, resource usage, etc.) [16,17] and their environmental characteristics (makespan, speed­up, success rate, etc.) [17]. Rather than specifying what is theo­retically possible with the given constructs, our work is instead based on an empirical analysis to detect similar data-intensive activities that recur in workflows across different domains and workflow systems. In addition, our work offers a complementary perspective in that we aim to understand groupings of workflow steps that form a meaningful high-level data manipulation opera­tion.

In Software Engineering, the term "pattern" refers to estab­lished best practices to solve recurring problems. In this regard pat­terns represent good and exemplary practice. In [18] the authors outline anti-patterns in scientific workflows, namely redundancy and structural conflicts. The authors go on to provide a solution to address the redundancy anti-pattern. Particularly due to this per­ception of the term "pattern", in this paper we opted to use the term "motif for our classification of tasks. Our objective is to take a snapshot of the existing set of activities in workflows, rather than try to prescribe a best practice.

Our Data-Operation motifs can be seen as a domain-independent classification of tasks within scientific workflows. Similar analy­ses have been done in a domain-specific manner in areas such as bioinformatics, based on user studies [19]. Combined with such-domain specific classifications, motifs can make way for specifica­tion of abstract workflow templates, which can be elaborated to concrete workflows prior to their execution [20].

Another work, somewhat closer to our study in spirit, is an au­tomated analysis of workflow scripts from the Life Science do­main [21]. This work aims to deduce the frequency of different kinds of technical ways of realizing workflow steps (e.g. service invocations, local "scientist-developed" scripting, local "ready-made" scripts, etc.). [21] also drills down into the category of local ready-made scripts, to outline a functional breakdown of their ac­tivity categories such as data access or data transformation. While this provides an insight into the kind of activities undertaken in workflows, it focuses on characterizing local task types. Our ap­proach is different from this work as we focus on detecting multi-step activities with many realizations (not just individual steps).

[22] extends the categories defined in [21] identifying sub­categories at a processor level by analyzing 898 workflows in my­Experiment. The main difference with our analysis is that some of the proposed categories are based on technological features of the processors (i.e., the type of script) for highlighting workflow reuse among the dataset, while our catalog relies on their func­tional characteristics.

Finally, Problem Solving Methods (PSMs) is another area of re­lated work. PSMs describe the reasoning process to achieve the goal of a task in an implementation and domain-independent man­ner [23]. Some libraries aim to model the common processes in scientific domains [24], which could be further refined with the motifs proposed in this work.

3. Analysis setup

For the purposes of the analysis, we used workflows from Taverna [2], Wings [3], Galaxy [4] and Vistrails [5]. These systems have different features:

• Taverna [2] can operate in different execution environments and provides several possibilities of deployment. Taverna is available as a workbench,4 which embodies a desktop design

http://www.workflowpatterns.com/patterns/data/. Taverna Workbench http://www.taverna.org.uk/download/workbench/.

Page 3: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

Table 1 Summary of the main differences in the features of each workflow system: explicit support for control constructs (conditional, loops), whether the user interface is web-based or not, whether the environment is open or controlled and the engine used.

Workflow system Control constructs GUI Environment type Engine

Taverna Wings Galaxy Vistrails

NO NO NO YES

Desktop/Web Web Web Desktop/Web

Open/Controlled Controlled Open/Controlled Open/Controlled

Taverna Pegasus/ Apache OODT Galaxy Vistrails

UI and an execution engine. Taverna also allows standalone de­ployments of its engine5 in order to serve multiple clients. In its default configuration Taverna does not prescribe that the datasets and tools are integrated into an execution environ­ment. In this sense it adopts an open-world approach, where workflows integrate (typically) remote third party resources and compose them into data-intensive pipelines. On the other hand, it also allows the development of plug-ins for the access and usage of dedicated computing infrastructures (e.g. grids) or local tools and executables. Its use has been extended from Bioinformatics to several domains including Astronomy, Chem­istry, Text Mining and Image Analysis.

• Wings [3] uses semantic representations to describe the con­straints of the data and computational steps in the workflow. Wings can reason about these constraints, propagating them through the workflow structure and use them to validate work­flows. It has been used in different domains, ranging from Life Sciences to Text Analytics and Geosciences. Wings provides a web based access and can run workflows locally, or submit6

them to the Pegasus/Condor [25] or Apache OODT [26] execu­tion environments that can handle large-scale distributed data and computations.

• Galaxy [27] is a web-based platform for data intensive biomedi­cal research which has many followers in the scientific commu­nity [28]. One of the main features of Galaxy is its cloud backend, which provides support for its extensive catalog of tools. These tools allow performing different types of analysis of data from widely used existing datasets in the biomedical domain. Galaxy uses its own engine for managing the workflow execution, com­patible with batch systems or Sun Grid Engine (SGE).7 Galaxy workflows can be run online8 or by setting up a local instance.9

• Vistrails [29] tracks the change-based provenance in workflow specifications in order to facilitate reuse. It has been used in different domains of Life Sciences like Medical Informatics and Biomedicine, but also in other domains like Image Processing, Climatology and Physics. Vistrails uses its own engine to man­age the execution, which allows for the combination of special­ized libraries, grid and web services. Its workflows can be run online10 or locally.11

The choice of these systems was due to the availability of a pool of shared workflows through repositories [11] [10] [30] and portals12,13 but also because of their similarities:

• They provide similar workflow modeling constructs. Unlike other workflow systems (e.g., business workflows), the selected

-1 Taverna Server http://www.taverna.org.uk/download/server/.

http://www.wings-workflows.org.

http://star.mit.edU/cluster/docs/0.93.3/guides/sge.html. Q

° https: //main.g2.bx.psu.edu/root.

http://wiki.galaxyproject.org/Admin/Get%20Galaxy. 1 U http://www.crowdlabs.org/vistrails/.

http://www.vistrails.org/index.php/Downloads. 12

workflow systems are often data-flow oriented, and they op­erate on large and heterogeneous data that may need to be archived to be used in further experiments.

• All of them are open-source scientific workflow systems, initially focused on performing in-silico experimentation on the Life Sciences (Taverna, Galaxy, Vistrails) and Geosciences (Wings) domains. Taverna, Wings and Vistrails now also have workflows across other different domains like Astronomy, Ma­chine Learning, Meteorology, etc.

• All the systems can interact with third party tools and services, and they include a catalog of components for performing differ­ent operations with data.

It is worth mentioning that despite being similar, we can find some differences among the selected systems. In particular, Vistrails has explicit mechanisms for the basic control constructs (e.g., conditionals and looping) in one of its latest releases, while Taverna, Wings and Galaxy are observers of the pure data-flow paradigm (although there are implicit ways of implementing such control structures).

The variety of environments in which these systems operate highlights some other differences as well. While Taverna allows users to specify workflows that make use of autonomous third party services (i.e. an open environment), Wings requires that the resources and the analysis operations are made part of its envi­ronment prior to being used in experiments (i.e. controlled envi­ronment). However, the difference between the systems is not a significant differentiating factor, as Taverna allows more control to be added to the environment through the addition of plug-ins, and Wings can establish a connection to third party services via custom components. Vistrails and Galaxy could be positioned at an inter­mediate point, since they provide access to external web-services but also build on a comprehensive library of components.

A summary of the commonalities and differences among the workflow systems included in the analysis can be seen in Table 1.

3.1. Description of the sets of workflows analyzed

For our analysis, we have chosen 260 heterogeneous workflows in a variety of domains. We analyzed a set of public Wings work­flows (89 out of 132 workflows), part of the Taverna set (125 out of 874 Taverna2 workflows in myExperiment), a set of Galaxy work­flows (26 out 145 of workflows) and part of the Vistrails set (20 out 274 of workflows).

• For Wings, we have analyzed all workflows from Drug Discov­ery, Text Mining, Domain Independent, Genomics and Social Network Analysis domains.

• For Taverna we have analyzed workflows that were available in myExperiment [10]. We determined the groups/domains of workflows by browsing the myExperiment group tags14 and identifying those domains which contained workflows that were publicly accessible at the time of the analysis. For the Tav­erna dataset we analyzed Cheminformatics, Genomics, Astron­omy, Biodiversity, Geo-Informatics and Text Mining domains.

13 https: //main.g2.bx.psu.edu/.

http://www.opmw.org/sparql. 1 4 http://www.myexperiment.org/groups.

Page 4: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

Table 2 Number of workflows analyzed from Taverna (T), Wings (W), Galaxy (G), Vistrails (V).

Table 3 Maximum, minimum and average number of steps within workflows per domain.

Domain No. of workflows Source

W

Drug discovery Astronomy Biodiversity Cheminformatics Genomics Geo-informatics Text analysis Social network analysis Medical informatics Domain independent

7 51 12

7 90

6 45

5 7

30

0 51 12 7

38 6

11 0 0 0

7 0 0 0

28 0

31 5 0

18

0 0 0 0

23 0 3 0 0 0

0 0 0 0 1 0 0 0 7

12

Total 260 125 89 26 20

The distribution of workflows to domains is not even, as it is also the case in myExperiment. Taverna is the workflow sys­tem with the largest public collection of workflows, in order to obtain a feasible subset of workflows for manual analysis, we made random selections from each identified domain.

• For Galaxy we have chosen the documented workflows avail­able in the public workflow repository15 (i.e., those workflows with annotations explaining the functionality of their compo­nents). Since Galaxy is specialized in the biomedical domain, most of the workflows are from the Genomics domain, although some of them (3) do text analysis operations in files.

• For Vistrails we have chosen a set of documented workflows available in Crowdlabs and tutorials, which include domains in Medical Informatics and Genomics. It is worth mentioning that some workflows are domain independent (machine learning workflows, visualization and rendering of datasets, annotation of texts, etc.), so they have been included under a new category.

When selecting the workflows for the analysis we paid attention to including workflows that are developed with the intention of backing actual data-intensive scientific investigations. We refrained from including toy or example workflows, which are used for demonstrating the capabilities of different workflow systems. Table 3 provides additional information on the size of workflows analyzed in terms of the range and average number of analysis tasks. The number of workflows analyzed from each domain can be seen in Table 2.

3.2. Approach for workflow analysis

Our analysis has been performed based on the documentation, metadata and traces available for each of the workflows within the cohort studied. Each workflow has been inspected using the associated workbench/design environment. Documentation on workflows is provided within workflow descriptions and in repos­itories in which the workflows are published. We have performed a bottom-up and manual analysis that aimed to surface two or­thogonal dimensions regarding activities/operations that make-up workflows: (1) outline what kind of data-operation activity is be­ing undertaken by a workflow step and (2) how that activity has been realized. For example, a visualization step (data oriented ac­tivity) can be realized in different ways: via a stateful multi-step invocation, through a single stateless invocation (depending on the environmental constraints and nature of the services), or via a sub-workflow.

Domain Max. size Min. size Avg steps

5 6

14 6

Drug discovery 18 1 Astronomy 33 1 Biodiversity 12 1 Cheminformatics 20 1 Genomics 53 1 Geo-informatics 14 3 Text analysis 15 1 Social network analysis 7 3 Medical informatics 29 8 Domain independent 20 1

Table 4 Scientific workflow motifs.

Data-Operation motifs Data preparation

Combine Filter Format transformation Input augmentation Output extraction Group Sort Split

Data analysis Data cleaning Data movement Data retrieval Data visualization

Workflow-Oriented motifs Inter workflow motifs

Atomic workflows Composite workflows Workflow overloading

Intra workflow motifs Internal macros Human interactions Stateful (asynchronous) invocations

The only automated part of the data collection was associated to the workflow size statistics for Taverna workflows. The myExperi­ment repository provides a REST API16 that allows retrieving infor­mation on Taverna workflows. Using this facility we were able to automate the collection of partial statistics regarding the number of workflow steps and the number of input/output parameters of Taverna workflows.

Rather than hypothesizing possible motifs up front, we built up a set of motifs as we progressed with the analysis. For each workflow we recorded the number of occurrences of each motif (independently of the workflow system it belonged to). In order to minimize misinterpretation and human error, the motif occur­rences identified for each workflow have been cross-validated by another author and discussed until agreement.

4. Scientific workflow motif catalog for abstracting workflows

This section introduces details on the scientific workflow motifs detected in our analysis. Motifs are divided into two categories: Section 4.1 introduces the Data-Operation motifs (i.e., the motifs related to the main functionality undertaken by the steps of the workflow), while Section 4.2 explains the Workflow-Oriented motifs (i.e., how the Data-Operation motifs are undertaken in the workflow). An overview is provided in Table 4.

https: //main.g2.bx.psu.edu/workflow/list_published. http://wiki.myexperiment.0rg/index.php/Devel0per:API.

Page 5: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

Merger I |SMAPAIignementResuItMergeft^-"-\vj' —.1 J____ ____ __ . 1; Merge

Fig. 1. Sample motifs in a Wings workflow fragment for drug discovery. A comparison analysis is performed on two different input datasets (SMAPV2). The results are then sorted (SMAPResultSorter) and finally merged (Merger, SMAPAlignementResultMerger).

4.1. Data-Operation motifs

4.1.1. Data preparation Data, once it is retrieved, may need several transformations

before being able to be used in a workflow step. The most common activities that we have detected in our analysis are:

4.1.1.1. Combine. Data merging or joining steps are commonly found across workflows. The Combine motif refers to the step or group of steps in the workflow aggregating information from dif­ferent inputs. An example can be seen in Fig. 1, where the results of both branches of a workflow fragment used for drug discovery are merged for presenting a single output result.

4.1.1.2. Filter. The datasets brought into a pipeline may not be subject to analysis in their entirety. Data could further be filtered, sampled or could be subject to extraction of various subsets.

4.1.1.3. Format transformation. Heterogeneity of formats in data representation is a known issue in many scientific disciplines. Workflows that bring together multiple access or analysis activities usually contain steps for format transformations, sometimes called "Shims" [31], that typically preserve the contents of data while converting its representation format.

4.1.1.4. Input augmentation. Data access and analysis steps that are handled by external services or tools typically require well formed query strings or structured requests as input parameters. Certain tasks in workflows are dedicated to the generation of these queries through an aggregation of multiple parameters. An exam­ple of this is provided in the workflow of Fig. 2: For each service invocation (e.g. getJobState) there are steps (e.g. getJobStateJnput) that are responsible for creating the correctly formatted inputs for the service.

4.1.1.5. Output extraction. Outputs of data access or analysis steps could be subject to data extraction to allow the conversion of data from the service format to the workflow internal data carrying structures. This motif is associated with steps that perform the

inverse operation of Input Augmentation. An example is given in Fig. 2, where output splitting steps (e.g. getJobState_output) are responsible for parsing the result XML message returned from the service (getJobState) to return a singleton value containing solely the job state.

4.1.1.6. Group. Some steps of the workflow reorganize the input into different groups or aggregate the inputs on a given collection of data items. For example, grouping a table by a certain category or executing a SQL GROUP-BY clause on an input dataset.

4.1.1.7. Sort. In certain cases datasets containing multiple data items/records are to be sorted (with respect to a parameter) prior to further processing. The Sort motif refers to those steps. Fig. 1 shows an example where the inputs resulting from the data anal­ysis {comparisonResults) are sorted {ComparisonResultsV2 compo­nent) before being merged in a subsequent step.

4.1.1.8. Split. Our analysis has shown that many steps in the work­flows separate an input (or group of inputs) into different outputs. For example, the splitting of a dataset in different subsets to be pro­cessed in parallel in a workflow.

4.1.2. Data analysis This motif refers to a rather broad category of tasks in diverse

domains, and it is highly relevant because it often represents the main functionality of the workflow. An important number of work­flows are designed with the purpose of analyzing or evaluating different features of input data, ranging from simple comparisons between the datasets to complex protein analysis to see whether two molecules can be docked successfully or not. An example is given in the workflow of Fig. 2 with a processing step named warp2D, and the steps named SMAPV2 in Fig. 1 with a ligand bind­ing sites comparison of the inputs.

4.1.3. Data cleaning/curation We have observed the steps for cleaning and curating data as

a separate category from data preparation and filtering. Typically these steps are undertaken by sophisticated tooling/services, or by

Page 6: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

s!ack..value ) Workflow input pons

NTimePoints value

jtputWarpedFi!eName_value/

sampPeaklistFile relrencePeakhstFile

max PeaksPerSegment_value

DataUpload J

getJobState_output DownloadResults

^ v >

Input Augmentation

Data { Analysis

Output Extraction

[ Worifl< iw output pons

ResultFile Message ^

Fig. 2. Sample motifs in a Taverna workflow for functional genomics. The workflow transfers data files containing proteomics data to a remote server and augments several parameters for the invocation request. Then the workflow waits for job completion and inquires about the state of the submitted warping job. Once the inquiry call is returned the results are downloaded from the remote server.

specialized users. A cleaning/curation step essentially preserves and enriches the content of data (e.g., by a user's annotation of a result with additional information, detecting and removing inconsistencies on the data, etc.).

4A.4. Data movement Certain analysis activities that are performed via external tools

or services require the submission of data to a location accessible by the service/tool (i.e., a web or a local directory respectively). In such cases the workflow contains dedicated step(s) for the up­load/transfer of data to these locations. The same applies to the outputs, in which case a data download step is used to chain the data to the next steps of the workflow. The data deposition of the workflow results to a specific server would also be included in this category. In Fig. 2, the DataUpload and DownloadResults steps ship data to the server on which the analysis will be done, and also re­trieve back the results via a dedicated download step.

4A.5. Data retrieval Workflows exploit heterogeneous data sources, remote data­

bases, repositories or other web resources exposed via SOAP or REST services. Scientific data deposited in these repositories are retrieved through query and retrieval steps inside workflows. Certain tasks within the workflow are responsible for retrieving data from such external sources into the workflow environment.

We also observed that certain data integration workflows contain multiple linked retrieval steps, being essentially parameterized data integration chains.

4A.6. Data visualization Being able to show the results is as important as producing

them in some workflows. Scientists use visualizations to show the conclusions of their experiments and to take important decisions in the pipeline itself. Therefore, certain steps in workflows are dedicated to generation of plots, graphs, tables, XMLs or Microsoft Excel files outputs from input data. This category is also known as the result delivery of the experimental results.

4.2. Workflow-Oriented motifs

We divide this category in two different groups, depending on whether motifs are observed among workflows, by analyzing the relations of the workflow with other workflows (Inter Workflow Motifs); or within workflows, by exploring the workflow itself (Intra Workflow Motifs).

4.2. \. Inter workflow motifs

4.2AA. Atomic workflows. Our review has shown that a significant number of workflows perform an atomic unit of functionality,

Page 7: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

00%

90% -

80% -

70% -

60% -

50%

40% -

30% -

20%

10% -

0% -

J? ..-5S' J,*1 & .if A- ,«S / / ^ y./ / /

I Data Preparation

l Data Analysis

I Data Cleaning

l Data Movement

I Data Retrieval

I Data Visualization

Fig. 3. Distribution of Data-Operation motifs per domain.

which effectively requires no sub-workflow usage. Typically these workflows are designed to be included in other workflows. Atomic workflows are the main mechanism of modularizing functionality within scientific workflows.

4.2.1.2. Composite workflows. The usage of sub-workflows ap­pears as a motif for exploiting modular functionality from multi­ple workflows. This motif refers to all those workflows that have one or more sub-workflows included in them (in some cases, sub-workflows offer different views of the global workflow, as they could have overlapping steps).

4.2.1.3. Workflow overloading. Our analysis has shown that au­thors tend to deliver multiple workflows with the same functional­ity, but operating over different input parameter types. An example is performing an analysis over a String input parameter versus per­forming it over the contents of a specified file, generalizing a work­flow to work with collections of files instead of single inputs, etc. Overloading is a response to the heterogeneity of environments, directly related to workflow reuse (as most of the functionality of the steps in the overloaded workflow remains the same).

4.2.2. Intra workflow motifs

4.2.2.1. Internal macros. This category refers to those groups of steps in the workflow that correspond to repetitive patterns of combining tasks. An example can be seen in Fig. 1, where there are two branches of the workflow performing the same operations in the same order (SMAPV2 plus SMAPResultSorterV2 steps).

4.2.2.2. Human interactions. We have observed that some scien­tific workflows systems increasingly make use of human interac­tions to undertake certain activities within workflows. These steps are often necessary to achieve some functionality of the workflow that cannot be (fully) automated, and requires human computa­tion to complete. Typical examples of such activities are manual data curation and cleaning steps (e.g., annotating a Microsoft Excel file), manual filtering activities (e.g., selecting a specific data subset to continue the experiment), etc.

4.2.2.3. Stateful/asynchronous invocations. Certain activities such as analysis or visualizations could be performed through interac­tion with stateful (web) services that allow for creation of jobs over remote grid environments. Such activities require invocation of multiple operations at a service endpoint using a shared state

identifier (e.g. Job ID). An example is given in the workflow of Fig. 2, where the service invocation warp2D causes the creation of a state­ful warping job. The call then returns a JobID, which is used to in­quire about the job status (getJobStatus), and to retrieve the results (DownloadResults).

5. Workflow analysis results

In this section, we report on the frequencies of the Data-Operation and Workflow-Oriented motifs within the workflows selected for our analysis, and discuss how they are correlated with the workflow domains.17 A detailed explanation of the frequencies for each domain and for each workflow system can be seen in the Appendix.

Fig. 3 illustrates the distribution of Data-Operation motifs across the domains. The analysis of this figure shows the predom­inance of the data preparation motif, which constitutes more than 50% of the Data-Operation motifs in the majority of domains. This is an interesting result as it implies that data preparation steps are more common than any other activity, specifically those that perform the main (scientifically-significant) functionality of the workflow. The abundance of these steps is one major obstacle for understandability, since they burden the documentation function and create difficulties for the workflow reuser scientists. The Social Network Analysis domain is an exception, as it consults, analyzes and visualizes queries and statistics over concrete data sources without performing any data preparation steps. Fig. 3 also demon­strates that within domains such as Genomics, Astronomy, Med­ical Informatics or Biodiversity, where curated common scientific databases exist, workflows are used as data retrieval clients against these databases.

Drilling down to Data Preparation, Fig. 4 shows the dominance of Filter, Input Augmentation and Output Extraction motifs for most domains. Input Augmentation and Output Extraction are activities which can be seen as adapters that help plugging data analysis capabilities into workflows. Their number is higher in workflows relying on third party services, i.e., most Taverna do­mains (Biodiversity, Cheminformatics, Geo-Informatics); while Fil­tering is higher in Wings, Galaxy and Vistrails workflows. Fig. 4 also demonstrates how the existence of a widely used common data

1 ' Results available at http://www.myexperiment.org/packs/364.html.

Page 8: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

S .-f & ip

<r J' <P c?

" / -

^ S? &

&

? #-

• Combine

• Filter

• Format Transformation

Group

• Input Augmentation

n Output Extraction

Sort

• Split

Fig. 4. Distribution of data preparation motifs per domain. The Social Network Analysis domain is not included, as it does not have any data preparation motifs.

100%

90%

80%

70%

60%

50%

40%

30%

20%

10% -

• Combine

• Filter

• Format Transformation

Group

• Input Augmentation

J Output Extraction

Sort

• Split

WINGS-LIFE TAVERNA-L1FE GALAXY-LIFE VISTRAILS-LIFE

SCIENCES SCIENCES SCIENCES SCIENCES

Fig. 5. Data preparation motifs in the Life Sciences workflows.

structure for a domain, in this case the VOTable in Astronomy.18 re­duces the need for domain-specific data transformations in work­flows.

Some of the differences between the systems are reflected in the motifs results. As displayed in the comparative Fig. 5 for the Life Sciences domain (a general domain shown in Table 5 includ­ing the Genomics, Drug discovery, Biodiversity, Chemical Infor­matics and Medical Informatics domains), in Wings, Galaxy and Vistrails input augmentation and output extraction steps are much less required (around 30%, 20% and 20% respectively versus almost 50% in Taverna) as the inputs are either controlled (Galaxy, Vis-trails) or strongly typed (Wings) and the data analysis steps are pre-designed to operate over specific types of data. Within Fig. 6 we observe that Wings workflows do not contain any data retrieval or movement steps, as data is pre-integrated into the workflow environment (data shipping activities are carried out behind the scenes by the Wings execution environment) whereas in Taverna the workflow carries dedicated steps for querying databases and shipping data to necessary locations for analysis. Galaxy and Vis-trails also include components to retrieve content from external

Table 5 Distribution of workflows from Taverna (T), Wings (W), Galaxy (G) and Vistrails (V) in the Life Sciences domain.

Domain No. of workflows Source

Drug discovery Biodiversity Cheminformatics Genomics Medical informatics

7 12

7 90

7

W T T T(38),W(28),G(23),V(1) V

Total 123

1R 1 0 http://www.ivoa.net/Documents/VOTable/.

datasets into the environment (2% and 10% respectively), although we did not find steps for moving the data of intermediate steps to external services among the set of workflows analyzed. In the case of Galaxy this happens because most data retrieval and mov­ing steps are performed transparently to the workflow execution (individual components are used to retrieve the data to the user do­main, and that data is then used as input of the workflow); while in Vistrails the main analysis steps of the analyzed workflow set were performed using custom components.

Another interesting finding is the amount of visualization steps found in the Life Science domain (Fig. 6). One feature of Vis­trails and Galaxy is the tools included for the visualization of

Page 9: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

Data Preparation

Data Analysis

Data Cleaning

Data Movement

Data Retrieval

Data Visualization

GALAXY-LIFE SCIENCES

VISTRAILS- LIFE SCIENCES

Fig. 6. Data-Operation motifs in the Life Sciences workflows.

I Atomic workflows

l Composite workflows

I Internal macro

I Human Interactions

iStateful Invocations

l Workflow Overload

GALAXY-LIFE SCIENCES

VISTRAILS-LIFE SCIENCES

Fig. 7. Workflow-Oriented motifs in the Life Sciences workflows.

workflow results. In Vistrails workflows almost 40% of the motifs found are visualization steps, but this percentage is very reduced in Galaxy (less than 5%). This is due to a separate visualization tool in Galaxy19 which reduces the need for visualization steps in the workflows. As shown in Fig. 6, the visualization steps in Taverna and Wings are considerably smaller (around 2% and 10% respec­tively).

The impact of the difference in the execution environments of the analyzed workflow systems is also observed on the Workflow-Oriented motifs, as can be seen in Fig. 7. Stateful invocations mo­tifs are not present in Wings, Galaxy and Vistrails workflows, as all steps are handled by a dedicated workflow scheduling frame­work/pipeline system and the details are hidden from the work­flow developers. In Taverna's default configuration, there are no execution environments or scheduling frameworks prescribed to the users. Therefore the workflow developers are (1) either respon­sible for catering for various different invocation requirements of external resources, which may include stateful invocations requir­ing execution of multiple consecutive steps in order to undertake a

1 y https: //main.g2.bx.psu.edu/visualization/trackster.

single function (2) they can develop specific plug-ins that wrap-up stateful interactions and boiler plate steps.

Regarding Workflow-Oriented motifs, Fig. 8 shows that human interaction steps are increasingly used in scientific workflows, especially in the Biodiversity and Cheminformatics domains. Human interaction steps in Taverna workflows are handled either through external tools (e.g., Google Refine), facilitated via a human-interaction plug-in, or through local scripts (e.g., selection of configuration values from multi-choice lists). However, we observed that several boiler-plate set-up steps are required for the deployment and configuration of external tooling to support the human tasks. These boiler plate steps result in very large and complex workflows. Wings and Vistrails workflows do not contain human interaction steps. Galaxy is an environment that is heavily based on user-driven configuration and invocation of analysis tools (some parameters and inputs of the workflows can even be changed after the execution of the workflow has started). However, based on our definition of Human Interactions, i.e. analytical data processing undertaken by a human, the Galaxy workflows that we have analyzed do not contain any human computation steps either.

Fig. 8 also shows a large proportion of the combination of Composite Workflows and Atomic Workflows motifs (up to more than 60%) which confirm that the use of sub-workflows is an

Page 10: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

'f//ss ̂ i?

er ^

I Atomic workflows

I Composite workflows

I Internal macro

I Human Interactions

I Statefull Invocations

I Workflow Overload

^

Fig. 8. Distribution of Workflow-Oriented motifs in the analyzed workflows per domain.

wfm:has Motif

jsubPropertyOf

wfm:hasWorkflowMotif|

sub Property Of

WorkflowStep^ywfm:hasDataOperationMoti

wfm: Motif

subclassOf

Fig. 9. Diagram showing an overview of the structure of the Motif Ontology. The wfm prefix is used for describing the different classes and properties of the Motif Ontology, while the ex prefix is used as a placeholder for any vocabulary for describing workflows and their steps.

established best practice for modularizing functionality and reuse. Sub-workflows can also be used to encapsulate the prepara­tion steps and multi-step service invocations within "Compo­nents" [32 ], in order to reduce obfuscation. These components have well-defined interfaces, support for complex data typing, interface metadata and built-in error-handling.20 The Workflow Overload motif also plays a relevant role, appearing in almost 10% of the analyzed workflows. Workflows containing this motif are consid­ered advanced forms of sub-workflow development. By executing workflows in different settings, the authors provide overloaded versions of the same functionality in different workflows to in­crease the coverage on target uses. While we observe overloading as a good practice, a significant behavior of workflow developers in the Taverna environment is to extend their workflows with the ability to accept input from multiple ports in different formats. We believe that such overloading behavior within a single workflow is a poor practice and should be avoided. Instead, multiple workflows operating a single designated input format should be provided.

6. Motif Ontology and annotation of scientific workflows

Our ultimate objective is to provide a catalog of motifs. We ex­pect that this catalog will be used to annotate workflows to denote the nature of activities occurring in them. These annotations would

allow (1) helping the creators to describe the particular function­ality of the workflows to reach a broader audience of possible reusers, (2) helping in the creation of new workflows by assist­ing users (e.g., suggesting components based in our motif catalog); and (3) helping in the search of workflows with certain functional­ity (e.g., workflows with data retrieval, analysis and filtering). This would also be beneficial from a workflow designer perspective, so as to obtain workflows that are similar to the ones being designed.

6. \. Representing motifs

In order to provide workflow designers and curators with the means to annotate, we have designed an OWL ontology21 that cap­tures the motifs detected in our empirical analysis. Fig. 9 illustrates the basic structure of the ontology. The right hand side of the fig­ure shows how the motifs are organized: the class wfm:Motif represents the different classes of motifs identified in Section 4 (wfm stands for the prefix the Motif Ontology, with namespace URI http://purl.org/net/wf-motifs). This class is categorized into two specialized sub-classes wfm:DataOperationMotif and wfm:Workf lowMotif, which are sub-classed according to the taxonomy represented in Table 4.

The ontology provides the wfm:hasMotif property in or­der to associate workflows and their operations with their mo­tifs. The properties wfm:hasDataOperationMotif and wfm:

http://heater.cs.man.ac.uk:4040/web-wf-design-neiss/index.html. http://purl.org/net/wf-motifs.

Page 11: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

wfm:hasData Operation Mot if wfm:lnputAugmentation

: m t f l

Fig. 10. Subset of the annotations of the Taverna workflow shown in Fig. 2 using the Wfdesc model.

hasWorkf lowMotif allow annotating workflows and their steps with more specificity. These properties have no domain specified, as different workflow models may use different vocabularies for describing workflows and their parts. Therefore, in Fig. 9 work­flows, steps and processes are used as a place holder using the ex prefix.

6.2. Representing workflows and workflow steps

Workflows may be represented with different models and vocabularies like Wfdesc [33], OPMW [30], P-Plan [34] or D-PROV [35]. While providing an abstract and consistent represen­tation of the workflow is not a pre-requisite to the usage of the Motif Ontology, we consider it a best-practice to use a model that is independent from any specific workflow language or technol­ogy. An example of annotation using the Wfdesc model is given in Fig. 10 by exposing the annotations of part of the Taverna workflow shown Fig. 2.

The annotations encoded using the Motif Ontology could be used for several purposes. By providing explicit semantics on the data processing characteristics and the implementation charac­teristics of the operations, annotations improve understandability and interpretation. Moreover, they can be used to facilitate work­flow discovery. For example, the user can issue a query to identify workflows that implement a specific flow of data manipulation and transformation (e.g., return the workflows in which data re­formatting is followed by data filtering and then data visualization). Furthermore, having information on characteristics of workflow operations allows for manipulation of workflows to generate sum­maries. As described in [36], motif markup allows users to spec­ify workflow reduction rules based on motifs (e.g. eliminate data preparation, organization steps, group all steps that belong to the same asynchronous invocation, etc.).

7. Conclusions

The difficulty in understanding the function of workflows is an impediment to reusing and re-purposing scientific workflows. To address this problem, motifs that provide high level descriptions of the tasks carried out by workflow steps can be effective. As a step towards this goal, we reported in this paper on an empirical analy­sis22 that we conducted using Taverna, Wings, Galaxy and Vistrails

^ Contents available at http://www.oeg-upm.net/files/dgarijo/motifAnalysisSite/.

workflows with the objective of identifying the motifs embedded within those workflows. In doing so, we have defined a catalog of motifs distinguishing Data-Operation motifs, which describe the tasks carried out by the workflow steps, from Workflow-Oriented motifs, which describe the way those tasks are implemented within the workflow. It is worth mentioning that, although important, motifs that have to do with scheduling and distributing the exe­cution of workflows [37] or the control flow of the workflow [13] have been left out of the scope of this paper.

We identified 6 major types of Data-Operation motifs and 6 types of Workflow-Oriented motifs that are used across different domains. We created a Motif Ontology based on the motif catalog that provides users with the necessary vocabulary to annotate workflows with high-level motifs to facilitate understanding. Part of our current work is the annotation of the motifs in workflows belonging to a provenance repository [38], in order to improve the existent workflow descriptions.

The frequency by which the motifs appear depends on the differences among the workflow environments and differences in domains. Regarding data preparation motifs (the most common type of motifs), we found that their use is correlated with the type of execution environment for which the workflow is designed. In particular, in a workflow system such as Taverna, which by default does not require data and tools to be embedded into the execution environment, many steps in the workflow can be dedicated to the moving and retrieval of heterogeneous datasets, and stateful resource access protocols. On the other hand, in a workflow system such as Wings, Vistrails and Galaxy we notice that some data preparation motifs, such as data moving, are minimal and in certain domains absent. This happens either because data is pre-integrated into the workflow execution environment or because data primarily exists in external environments and the workflow execution engine performs these operations transparently to the user. The differences among the systems highlight that specialized resource or data access components/plug-ins and standardization in data formats would contribute significantly to the simplification of scientific workflows.

The workflows used in our analysis are taken from a variety of heterogeneous domains and have been crafted by a group of different domain experts scientists. The distribution of the cohort studied among the domains is not even, as their number, availability and documentation differs for each workflow system. Our catalog of motifs refers to the workflows included in the analysis (with special relevance of the Life Science domain), but our intuition is that most of the motifs will be found in other domains and in other workflow systems. Future work expanding the analysis

Page 12: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

Table A. 1 Distribution of the data preparation motifs among the workflows analyzed, grouped by domain.

Combine Filter Format transformation Group Input augmentation Output extraction Sort Split Data preparation

Genomics TextMining Drug discovery Astronomy Biodiversity Cheminformatics Geo-informatics Social network analysis Domain independent Medical informatics

93 12 7

77 5 3 0 0 2 0

62 64

3 29

0 1 1 0

27 21

56 33

4 21

7 4 7 0

19 18

4 0 0 4 0 0 0 0 1 0

87 7 0

32 3 6

12 0 0 2

90 9 4

39 2 6

13 0

15 6

6 15 6 0 0 0 0 0 0 0

24 5 0

13 7 1 1 0 0 0

422 145

24 215

24 21 34

0 64 47

Table A.2 Distribution of the Data-Operation motifs among the workflows analyzed, grouped by domain.

Genomics TextMining Drug discovery Astronomy Biodiversity Cheminformatics Geo-informatics Social network analysis Domain independent Medical informatics

Data preparation

422 145

24 215

24 21 34

0 64 47

Data analysis

134 46 12 48

4 1 8 5

29 1

Data cleaning

13 9 6

20 4

12 0 0 0 0

Data movement

32 5 0

30 7

18 0 0 0 0

Data retrieval

63 2 0

48 7 3 3 5 8 7

Data visualization

26 3 3 2 0 3 1

18 51 33

Total data operation

690 210 45

362 46 58 46 28

152 88

Table A3 Distribution of the Workflow-Oriented motifs among the workflows analyzed, grouped by domain.

Atomic workflow Composite workflow Internal macro Human interaction

Stateful invocation

Workflow overload

Total workflow motifs

Genomics Text Mining Drug discovery Astronomy Biodiversity Cheminformatics Geo-informatics Social network analysis Domain independent Medical informatics

62 25 4

33 7 3 4 3

14 5

26 31

3 57 19 0 2 2

16 2

37 19 6 2 0 0 0 0 4 5

4 0 0 7

10 1 0 0 0 0

6 0 0 4 2 1 0 0 0 0

4 14 2 0 2 0 0 2

11 0

139 89 15

103 40

5 6 7

45 12

on other systems like Kepler, Pegasus or ASKALON will be required to further validate our findings.

As part of our future work, we envisage deriving best practices that can be used in workflow design and providing tools that assist users in automatic workflow annotation using our Motif Ontology. In particular, in an environment like Wings, where semantic typ­ing is supported, it could be possible to automatically detect some Data Preparation activities by inferencing over the types of inputs and the outputs and the task types. In an open environment like Taverna, such classifications are not available, but there are other resources for inferring functionality of steps, like controlled tags on services and the names of processors. In a controlled environ­ment like Galaxy, the modules are already classified in a taxonomy, which could be used to infer some of the proposed motifs. Vistrails, on the other hand, normally produces well documented modules in their workflows. Such documentation could be used to derive the motifs of the workflow. Our identification of Workflow-Oriented motifs also acts as a set of heuristics for automatically creating abstractions over workflows [39], like grouping stateful interac­tions on a service endpoint, detection of data preparation activities to highlight the real functionality of the workflow, detecting sub­groups of repeated data preparations steps (i.e., internal macros), etc.

Finally, another area for future work is to analyze how motifs could be used to manage workflow decay. By understanding the

functionality of the workflow steps, it should be easier to replace broken steps or fragments with alternative or updated services. In order to achieve this goal, a further exploration of our motifs in combination with the intrinsic and environmental characteristics of the workflows [17] will be required.

Appendix

This section details the occurrences of the motifs for all the workflows measured in the analysis. Tables A.1-A.3 provide details

Page 13: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

Table A.4 Distribution of the data preparation motifs among the workflows analyzed, grouped by workflow system.

Combine Filter Format transformation Group Input augmentation Output extraction Sort Split Data preparation

Wings Taverna Galaxy Vistrails

42 143

12 2

67 60 48 33

50 64 17 38

0 4 4 1

15 126

5 3

16 127

19 22

21 0 6 0

8 43

0 0

219 567 111 99

Table A.5 Distribution of the Data-Operation motifs among the workflows analyzed, grouped by workflow system.

Wings Taverna Galaxy Vistrails

Data

219 567 111 99

preparation Data

124 112 46

6

analysis Data cleaning

9 49

6 0

Data movement

0 92

0 0

Data retrieval

5 124

1 16

Data visualization

36 9 8

87

Tota

393 953 172 208

Total data operation

Table A.6 Distribution of the Workflow-Oriented motifs among the workflows analyzed, grouped by workflow system.

Atomic workflow Composite workflow Internal macro Human interaction Stateful invocation Workflow overload Total workflow motifs

Wings 48 Taverna 75 Galaxy 22 Vistrails 15

41 108

4 5

31 22 11 9

0 22

0 0

0 13 0 0

27 147 248

37 29

of the number of motifs per workflows grouped by domain, while Tables A.4-A.6 specify the occurrences of each motif grouped by the workflow system.

References

[1] Ewa Deelman, Dennis Gannon, Matthew Shields, Ian Taylor, Workflows and e-science: an overview of workflow system features and capabilities, Future Generation Computer Systems 25 (5) (2009) 528-540.

[2] Katherine Wolstencroft, Robert Haines, Donal Fellows, Alan Williams, David Withers, Stuart Owen, Stian Soiland-Reyes, Ian Dunlop, Aleksandra Nenadic, Paul Fisher, Jiten Bhagat, Khalid Belhajjame, Finn Bacall, Alex Hardisty, Abraham Nieva de la Hidalga, Maria P. Balcazar Vargas, Shoaib Sufi, Carole Goble, The taverna workflow suite: designing and executing workflows of web services on the desktop, web or in the cloud, Nucleic Acids Research (2013).

[3] Yolanda Gil, Varun Ratnakar, Jihie Kim, Pedro A. Gonzalez-Calero, Paul T Groth, Joshua Moody, Ewa Deelman, Wings: intelligent workflow-based design of computational experiments, IEEE Intelligent Systems 26 (1) (2011) 62-72.

[4] Jeremy Goecks, Anton Nekrutenko, James Taylor, Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computa­tional research in the life sciences, Genome Biology 11 (8) (2010) R86.

[5] Carlos E. Scheidegger, Huy T Vo, David Koop, Juliana Freire, Claudio T. Silva, Querying and re-using workflows with vistrails, in: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD'08, ACM, New York, NY, USA, 2008, pp. 1251-1254.

[6] Bertram Ludascher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger, Matthew Jones, Edward A. Lee, Jing Tao, Yang Zhao, Scientific workflow management and the kepler system, Concurrency and Computation: Practice and Experience 18 (10) (2006) 1039-1065.

[7] Thomas Fahringer, Radu Prodan, Rubing Duan,Jiirgen Hofer, Farrukh Nadeem, Francesco Nerieri, Jun QJn Stefan Podlipnig, Mumtaz Siddiqui, Hong-Linh Truong, Alex Villazon, Marek Wieczorek, ASKALON: a development and grid computing environment for scientific workflows, in: Ian J. Taylor, Ewa Deelman, Dennis B. Gannon, Matthew Shields (Eds.), Workflows for e-Science, Springer, 2007, pp. 450-471 (Chapter 27).

[8] Jia Zhang, Wei Tan, J. Alexander, I. Foster, R Madduri, Recommend-as-you-go: a novel approach supporting services-oriented scientific workflow reuse, in: Proceeding of IEEE International Conference on Services Computing, SCC, Washington, DC, July 2011, pp. 48-55.

[9] Antoon Goderis, Ulrike Sattler, Phillip W. Lord, Carole A. Goble, Seven bottlenecks to workflow reuse and repurposing, in: International Semantic Web Conference, Springer, 2005, pp. 323-337.

[10] David De Roure, Carole A. Goble, Robert Stevens, The design and realisation of the myexperiment virtual research environment for social sharing of workflows, Future Generation Computer Systems 25 (5) (2009) 561-567.

[11] Phillip Mates, Emanuele Santos, Juliana Freire, Claudio T. Silva, Crowdlabs: social analysis and visualization for the sciences, in: 23rd International Conference on Scientific and Statistical Database Management, SSDBM, Springer, 2011, pp. 555-564.

[12] Daniel Garijo, Pinar Alper, Khalid Belhajjame, Oscar Corcho, Yolanda Gil, Carole Goble, Common motifs in scientific workflows: an empirical analysis, in: 8th IEEE International Conference on eScience 2012, Chicago, IEEE Computer Society Press, USA, 2012.

[13] Wil M.P. van der Aalst, Arthur H.M. ter Hofstede, Bartek Kiepuszewski, Alistair P. Barros, Workflow patterns, Distributed and Parallel Databases 14(1) (2003) 5-51.

[14] Sara Migliorini, Mauro Gambini, Marcello La Rosa, Arthur H.M. ter Hof­stede, Pattern-based evaluation of scientific workflow management sys­tems. Technical Report, Queensland University of Technology, 2011, URL: http://eprints.qut.edu.au/39935/.

[15] Malcolm Atkinson, Rob Baxter, Paolo Besana, Michelle Galea, Mark Parsons, Peter Brezany, Oscar Corcho, Jano van Hemert, David Snelling (Eds.), THE DATA BONANZA—Improving Knowledge Discovery for Science, Engineering and Business, WILEY-INTERSCIENCE, A John Wiley & Sons, Inc. Publication, Edinburgh, UK, 2013.

[16] Lavanya Ramakrishnan, Beth Plale, A multi-dimensional classification model for scientific workflow characteristics, in: Proceedings of the 1st International Workshop on Workflow Approaches to New Data-centric Science, Wands'10, ACM, New York, NY, USA, 2010, pp. 4:1-4:12.

[17] Simon Ostermann, Radu Prodan, Thomas Fahringer, Ru Iosup, Dick Epema, On the characteristics of grid workflows, in: Proceedings of CoreGRID Integration Workshop 2008, Hersonisson, Crete, 2008, pp. 431-442.

[18] Sarah Cohen-Boulakia, Jiuqiang Chen, Paolo Missier, Carole Goble, Alan R Williams, Christine Froidevaux, Distilling structure in taverna scientific workflows: a refactoring approach, BMC Bioinformatics (2013) to appear.

[19] R.D. Stevens, CA. Goble, P. Baker, A. Brass, A classification of tasks in bioinformatics, Bioinformatics 17 (2) (2001) 180-188.

[20] Yolanda Gil, Paul Groth, Varun Ratnakar, Christian Fritz, Expressive reusable workflow templates, in: Proceedings of the Fifth IEEE International Conference on e-Science, Oxford, UK, 2009.

[21] Ingo Wassink, Paul E Van Der Vet, Katy Wolstencroft, Pieter B T Neerincx, Marco Roos, Han Rauwerda, Timo M Breit, Analysing scientific workflows: why workflows not only connect web services, Published in 2009 World Conference on Services -1 . , 2009, pp. 314-321 (5).

[22] Johannes Starlinger, Sarah Cohen-Boulakia, Ulf Leser, (Re)use in public scientific workflow repositories, in: Anastasia Ailamaki, Shawn Bowers (Eds.), Scientific and Statistical Database Management, in: Lecture Notes in Computer Science, vol. 7338, Springer, Berlin, Heidelberg, 2012, pp. 361-378.

[23] Asuncion Gomez Perez, Richard Benjamins, Applications of ontologies and problem-solving methods, AI Magazine 20 (1) (1999).

[24] Jose Manuel Gomez-Perez, Michael Erdmann, Mark Greaves, Oscar Corcho, Richard Benjamins, A framework and computer system for knowledge-level acquisition, representation, and reasoning with process knowledge, International Journal of Human-Computer Studies 68 (10) (2010).

[25] E. Deelman, G. Singh, M. Su, J. Blythe, Y Gil, C. Kesselman, J. Kim, G. Mehta, K Vahi, G.B. Berriman,J. Good, A. Laity,J.C.Jacob, D.S. Katz, Pegasus: a framework for mapping complex scientific workflows onto distributed systems, Scientific Programming 13 (3) (2005).

Page 14: Common motifs in scientific workflows: An empirical analysisoa.upm.es/21854/1/1-s2_0-S0167739X13001970-main.pdf · an analysis of 177 workflows from Wings and Taverna. The new contributions

[26] Chris A. Mattmann, Daniel J. Crichton, Nenad Medvidovic, Steve Hughes, A software architecture-based framework for highly distributed and data intensive scientific applications, in: Proceedings of the 28th International Conference on Software Engineering, ICSE'06, ACM, New York, NY, USA, 2006, pp. 721-730.

[27] B. Giardine, et al., Galaxy: a platform for interactive large-scale genome analysis, Genome Research 15 (10) (2005) 1451-1455.

[28] Daniel Blankenberg, James Taylor, Ian Schenck, Jianbin He, Yi Zhang, Matthew Ghent, Narayanan Veeraraghavan, Istvan Albert, Webb Miller, Kateryna D Makova, et al., A framework for collaborative analysis of encode data: making large-scale analyses biologist-friendly, Genome Research 17 (6) (2007) 960-964.

[29] Steven P. Callahan, Juliana Freire, Emanuele Santos, Carlos E. Scheidegger, Cludio T. Silva, Huy T. Vo, Vistrails: visualization meets data management, in: ACM SIGMOD, ACM Press, 2006, pp. 745-747.

[30] Daniel Garijo, Yolanda Gil, A new approach for publishing workflows: abstractions, standards, and linked data, in: Proceedings of the 6th Workshop on Workflows in Support of Large-Scale Science, ACM, Seattle, 2011, pp. 47 -56.

[31] Duncan Hull, Robert Stevens, Phillip Lord, Chris Wroe, Carole Goble, Treating shimantic web syndrome with ontologies, in: AKT Workshop on Semantic Web Services, 2004.

[32] Kevin Davies, Democratizing informatics for the long tail scientist, Bio-IT World Magazine (2011).

[33] Khalid Belhajjame, Oscar Corcho, Daniel Garijo, Jun Zhao, Paolo Missier, David Newman, Raul Palma, Sean Bechhofer, Esteban Garcia-Cuesta, Jose-Manuel Gomez-Perez, Graham Klyne, Kevin Page, Marco Roos, Jose Enrique Ruiz, Stian Soiland-Reyes, Lourdes Verdes-Montenegro, David De Roure, Carole Goble, Workflow-centric research objects: first class citizens in scholarly discourse, in: Proceedings of Sepublica2012, 2012, pp. 1-12.

[34] Daniel Garijo, Yolanda Gil, Augmenting prov with plans in p-plan: scientific processes as linked data, in: Second International Workshop on Linked Science: Tackling Big Data (LISC), Held in Conjunction with the International Semantic Web Conference, ISWC, Boston, MA, 2012.

[35] Paolo Missier, Saumen Dey, Khalid Belhajjame, Victor Cuevas, Bertram Ludaescher, D-PROV: extending the PROV provenance model with workflow structure, in: Procs. TAPP'13, Lombard, IL, 2013.

[36] Pinar Alper, Khalid Belhajjame, Carole Goble, Pinar Karagoz, Small is beautiful: Summarizing scientific workflows using semantic annotations, in: Proceedings of the IEEE 2nd International Congress on Big Data, BigData 2013, Santa Clara, CA, USA, June 2013.

[37] Arun Ramakrishnan, Gurmeet Singh, Henan Zhao, et al., Scheduling data-intensive workflows onto storage-constrained distributed resources, in: CC-GRID, IEEE Computer Society, 2007, pp. 401-409.

[38] Khalid Belhajjame, Jun Zhao, Daniel Garijo, Aleix Garrido, Stian Soiland-Reyes, Pinar Alper, Oscar Corcho, A workflow prov-corpus based on taverna and wings, in: Proceedings of the Joint EDBT/ICDT 2013 Workshops, EDBT'13, ACM, New York, NY, USA, 2013, pp. 331-332.

[39] Daniel Garijo, Oscar Corcho, Yolanda Gil, Detecting common scientific workflow fragments using templates and execution provenance, in: The Seventh International Conference on Knowledge Capture, K-CAP, Banff, Alberta, Canada, 2013.