Using Provenance to Improve Workflow Design
Frederico TostaLeonardo MurtaClaudia WernerMarta Mattoso
{ftoliveira, murta, werner, marta}@cos.ufrj.br
COPPE – Federal University of Rio de Janeiro - Brazil
UFRJ
2
Summary
•Motivation
• Introduction & Background
•Goal
•Approach & Implementation
•Conclusion
COPPE/UFRJ
3
Motivation
Pieces of workflows that occurred in the past may occur again in the future.
COPPE/UFRJ
4
Motivation
• The number of services and bioinformatics operations are growing: Taverna has over 3500 (2007). VisTrails has over 1200 Modules (2008).
WorkflowServicesWorkflow
ServicesWorkflowServicesWorkflows and
WF Services
COPPE/UFRJ
5
Motivation
How can we find the pieces or services that are useful during the design of a new workflow in an automatic and systematic way?
COPPE/UFRJ
6
Software Reuse
• Is the process of creating software systems from existing software [Krueger, 1992].
Quality
Reliability Reduced Cost
Productivity
SoftwareReuse
COPPE/UFRJ
7
Recommendation Systems
• E-Commerce: Apply data mining techniques to the problem of
helping user finding the items they would like to purchase.
Domain Concepts
E-commerce Customer Product* Cart Preference
Scientific Experiment
Scientist Component / Actor
Workflow(Goble, 2007)
Context
E-commerce concepts mapped into scientific experiment concepts
* what is recommended by e-commerce sites
COPPE/UFRJ
8
Goal
• Propose a proactive recommendation service that aims at suggesting frequent combinations of scientific programs for reuse.
COPPE/UFRJ
9
Approach
Workflow specification
Workflow specification
DB
Design
Design for reuse and recommendation
Provenance
COPPE/UFRJ
10
Approach
Workflow specification
Workflow specification
DB
Design
ProactiveRecommendation
Design with reuse and recommendation
Provenance
COPPE/UFRJ
11
Implementation
• Populating the database: VisTrails workflows:
- Parse provenance xml files to extract the relations.
MySQL database:- The relations are mapped into a database.- Each relation contains the modules and how
they are connected.
COPPE/UFRJ
12
Implementation
VisTrails workflow design with recommendation
Source Destination Source Port Dest Port
HmmBuild HmmCalibrate DestinationDir SourceDir
HmmBuild Cat DestinationDir Dir
HmmBuild HmmCalibrate DestinationDir HmmPath
HmmBuild HmmCalibrate StdOut HmmPath
HmmBuild HmmCalibrate StdOut HmmPath
Ports 1 and 2 are the output ports DestinationDir and StdOut, respectively. Ports 3, 4 and 5 are the input ports SourceDir, HmmPath and Dir, respectively
•Recommendation Metric:From the example, we can infer that port StdOut of HmmBuild has been connected to port HmmPath of HmmCalibrate in 40% of previously designed workflows.
COPPE/UFRJ
13
Implementation
VisTrails workflow design with recommendationCOPPE/UFRJ
14
Conclusion
• We expect that this approach may help to propagate the benefits of software reuse to the context of scientific workflows.
• Reduce the time to design workflows.
• Increase the quality of workflows designed.
COPPE/UFRJ
15
Conclusion
•Limitations: The current version of our prototype recommends
only a subsequent component based on previously used connection.
• Future works: Improve the approach recommending a
component investigating the whole path. Specify a context to each workflow. Apply weight to each relation based on workflow
usage.
COPPE/UFRJ
16
Using Provenance to Improve Workflow Design
UFRJ
Frederico TostaLeonardo MurtaClaudia WernerMarta Mattoso
{ftoliveira, murta, werner, marta}@cos.ufrj.br
COPPE – Federal University of Rio de Janeiro - Brazil