www.ci.anl.govwww.ci.uchicago.edu
Creating and Sharing Re-usable Workflows in Cardiovascular Research: Lessons learned using Taverna
Ravi MadduriUniversity of ChicagoArgonne National Laboratory
www.ci.anl.govwww.ci.uchicago.edu
2
About me
• Research Fellow at the Computation Institute, University of Chicago
• Lead architect for Workflow technologies in the caBIG project
• Workflow Working Group Chair and a key person in the BIRN project
• Interested in Informatics, Applications of High throughput data transfer, computing in Biomedical informatics
www.ci.anl.govwww.ci.uchicago.edu
3
And..
www.ci.anl.govwww.ci.uchicago.edu
4
Agenda
• Introduction to Service Oriented Science (SoS)• Introduction to caBIG as an example of SoS• Introduce caGrid as an enabler of SoS vision• Introduce Workflow concepts• Talk about our implementation using Taverna• Show a few Taverna workflows including the
AutoQRS workflow from CVRG• Lessons learned and future directions.
www.ci.anl.govwww.ci.uchicago.edu
5
Service-Oriented Science
People create services (data, code, instr.) …which I discover (& decide whether to use) …& compose to create a new function ... & then publish as a new service.
I find “someone else” to host services, so I don’t have to become an expert in operating services & computers!
I hope that this “someone else” can manage security, reliability, scalability, …
!!“Service-Oriented Science”, Science, 2005
www.ci.anl.govwww.ci.uchicago.edu
6
caBIG Goal and Vision
caBIG is a virtual web of interconnected data, individuals and organizations that redefines how research is conducted, care is provided, and patients/participants interact with the biomedical enterprise.
• Connect the cancer research community through a shareable, interoperable infrastructure
• Deploy and extend standard rules and a common language to more easily share information
• Build or adapt tools for collecting, analyzing, integrating and disseminating information associated with cancer research and care
www.ci.anl.govwww.ci.uchicago.edu
7
caGrid
caBIG function dimensions
Clinical Data and Trials Management
Biospecimen Management
In Vivo Imaging
Molecular Characterization
www.ci.anl.govwww.ci.uchicago.edu
8
What is caGrid?
• Biomedical applications that share data all have common needs for syntactic and semantic interoperability
• caGrid is a software toolkit aimed at software developers creating Grid applications
www.ci.anl.govwww.ci.uchicago.edu
9
caGrid provides
• Metadata services that add semantic information to all Grid services
• The GAARDS toolkit, a standard security platform
• Introduce: the ‘Eclipse’ for services development• Index Service: A service registry for
advertisement and discovery of capabilities
www.ci.anl.govwww.ci.uchicago.edu
10
caGrid: nuts and bolts
www.ci.anl.govwww.ci.uchicago.edu
11
A scientific workflow
• precisely defines a multi-step procedure, to seamlessly integrate and streamline local and remote heterogeneous computational and data resources to perform in silico scientific exploration.
www.ci.anl.govwww.ci.uchicago.edu
Workflow Requirements
12
Service discovery
Data access
Service interaction
Security enforcement
Knowledge sharing
www.ci.anl.govwww.ci.uchicago.edu
13
caGrid
data
instruments
computation resource
Virtualization
Security
Connectivity
Overview of caGrid Workflow
Cancer Data Standards Repository
Discovery Composition
Orchestration
Analysis
Community
reuse
generate
• Workflow as consumer- Easily reuse services for complex
experiments.
- Workflow as contributor - Workflow as “best practice”
wrapped as services.
- Workflow providing RoI for SOA
www.ci.anl.govwww.ci.uchicago.edu
14
(1) Service discovery
Index Metadata
(2) Data access
Data servicesAnalytical services
Security services
(5) Knowledge
sharing
(4) Security enforcement
Taverna workbench
(3) Service invocation
caFlow
(1)
(2)
(3)
(4)(5)
authen. credentialdelegation
...
• caGrid Workflow Suite • Service discovery• Data access• Service interaction• Security enforcement• Knowledge sharing
www.ci.anl.govwww.ci.uchicago.edu
15
The caBIG Workflow System
caGrid
Cancer Data Standards Repository
Discovery composition
Execution Reuse
Community
reuse
generate
Service discovery based on cancer research metadata.
Data-flow modeling flavor caGrid activity
State management (WSRF)Security (GSI)
Implicit iteration: handle parallel executionWSRF and GSI enforcement
A “Facebook” for caGrid workflows
Workflow Execution. ServiceWorkflows in caGrid Portal
Semantic Service Discovery• Semantic search – searches Index Service for registered caGrid services
matching various search criteria:– Service name, inputs, outputs, research center,
class names, concept codes, etc.
www.ci.anl.govwww.ci.uchicago.edu
17
Service metadata • Types of query- String based. - Property based.- Semantic based.
Semantic Service Discovery
www.ci.anl.govwww.ci.uchicago.edu
18
caBIG services palette
• As a result of semantic search or direct adding– caBIG services appear in Taverna’s Service Panel– Ready to be drag
and dropped into caGrid workflows
www.ci.anl.govwww.ci.uchicago.edu
19
Data access: CQL Builder
www.ci.anl.govwww.ci.uchicago.edu
20
Service interaction: managing state
0
10
20
www.ci.anl.govwww.ci.uchicago.edu
21
Security enforcement
• Authentication– Ability to invoke services secured by Grid Security
Infrastructure (GSI)– Integrated caGrid Security framework (GAARDS)
with Taverna’s Credential manager– Transport Level Security
• Authorization– This is done on the service side upon looking at
User’s credentials• Credential Delegation Service Integration
www.ci.anl.govwww.ci.uchicago.edu
22
Secure Grid services
• Taverna can invoke secure Grid services that require user to log in to caGrid
• Taverna interacts with caGrid’s GAARDS infrastructure to obtain user’s proxy:– Authenticate the user with user’s affiliated
Authentication Service– Obtain user’s proxy from Dorian Service– Default proxy lifetime: 12 hours
www.ci.anl.govwww.ci.uchicago.edu
23
Using secure caGrid services
• Involves:1. Discovering a secure caGrid service from
Taverna2. Logging onto selected caGrid to obtain a
proxy certificate3. Saving and managing caGrid proxies and
username and passwords
www.ci.anl.govwww.ci.uchicago.edu
24
Configuring secure services (1/2)
• Authentication Service and Dorian Service urls required in order to obtain user’s proxy
• Can be configured globally for all services from the same caGrid (in preferences)
• Can be configured individually for a particular caGrid service (overrides configuration from preferences)
www.ci.anl.govwww.ci.uchicago.edu
25
Configuring secure services (2/2)
• View secure’s service details• Configure service’s
security properties
www.ci.anl.govwww.ci.uchicago.edu
26
Logging onto caGrid
• User is prompted for his caGrid username and password when any secure service is invoked from a workflow for the first time
www.ci.anl.govwww.ci.uchicago.edu
27
Credential management
• Taverna obtains proxy for user from Dorian Service using user’s caGrid username and password
• Proxies are saved and managed byCredential Manager
• caGrid username and password can also be remembered
www.ci.anl.govwww.ci.uchicago.edu
28
Workflow execution service
Taverna Workflow Service wraps the Taverna execution engine into a WS-Resource and exposes operations such as createResource, startWorkflow, getStatus, and getOutput for user submitted workflows.
startWorkflowcreateResource
getStatus
getOutput
Workflow Service
Stateful Resources
(Resource Properties)
EPR
Taverna Engine
Data Services
Analytical Services
caGrid &
Other Services
Client API
Taverna Workbench Workflow Portlet
www.ci.anl.govwww.ci.uchicago.edu
29
Workflow execution service
• Taverna Workflow Service • Provides stateful resources that execute the
workflows.• Supports caGrid security architecture (GSI
Security).• Allows programmatic submission of workflows.
www.ci.anl.govwww.ci.uchicago.edu
30
Access Taverna workflow via caGrid portal
Taverna Workflow Portlet is deployed in the caGrid Portal on the training Grid:
URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow
•The Portlet currently lists a few workflows with their descriptions that can be browsed from the above URL
• Users can select a workflow they are interested in running.
View : 1
www.ci.anl.govwww.ci.uchicago.edu
31
Access Taverna workflow via caGrid portal
URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow
• Based on the number of input ports in the workflow, the portlet prompts the users to enter the input values in the textbox.
• For example, the Lymphoma workflow takes only one input in the form an Experiment ID that identifies the experiment that caArray uses for data collection.
• Hit submit after the entering the data.
View : 2
www.ci.anl.govwww.ci.uchicago.edu
32
Access Taverna workflow via caGrid portal
URL : http://portal-demo.training.cagrid.org/web/guest/tools/taverna-workflow
• The portlet stores the user submitted workflows in the current session of the portal.
• Users can View all the Active and Completed Workflows in the session.
• Clicking the Output Button shows the output of the workflow.
• The portlet provides workflow specific view-resolvers to render the outputs. For E.g: Lymphoma workflow currently displays the output in a html table.
Views : 3, 4, & 5
www.ci.anl.govwww.ci.uchicago.edu
33
• Search ‘cabig’ in myExperiment or • Type
http://www.myexperiment.org/search?type=workflows&query=cabig
• Typehttp://tinyurl.com/cabig-workflow
Knowledge Sharing
www.ci.anl.govwww.ci.uchicago.edu
Discovery using myExperiment
34
www.ci.anl.govwww.ci.uchicago.edu
MicroArray from
tumor tissue
Microarray
preProcessing
Lymphoma
prediction
Lymphoma Prediction Workflow
www.ci.anl.govwww.ci.uchicago.edu
Lymphoma type prediction
Acknowledgement: Juli Klemm, Xiaopeng Bian, Rashmi Srinivasa (NCI)Jared Nedzel (MIT)
www.ci.anl.govwww.ci.uchicago.edu
AutoQRS Analysis Workflow
WFDB binary and Patient ID
WFDBdata service
AutoQRS Output Data
Service
AutoQRS Analytical
Service
Retrieve WFDB Patient Record
JSDL service
InvokeProcessing
AnalysisExecutionRecord
AutoQRS XML Results
Store WFDB
www.ci.anl.govwww.ci.uchicago.edu
38
The Taverna workflow
www.ci.anl.govwww.ci.uchicago.edu
39
The result in MS Excel
www.ci.anl.govwww.ci.uchicago.edu
40
Accomplishments
• Lymphoma workflow – Among the top 20 most viewed/downloaded Workflows in myExperiment– This is more impressive given that this workflow was
uploaded much later than the other workflows• Our BMC-Bioinformatics Article on “caGrid
Workflow Toolkit: A Taverna based workflow tool for cancer Grid” achieved “Highly Accessed” relative to its age
• We are part of the CVRG Project that recently got renewed
www.ci.anl.govwww.ci.uchicago.edu
41
Lessons Learned
• Lower the barriers to entry for sharing data and analytics
• Software is surprisingly hard to use for end users – more so if the benefit is not all too clear
• Return on Investment of a SOA is in creating reusable workflows (LEGO blocks)
• Workflows are only as good as the services we create
• Traditional SDLC does not always work in the favor of the end users
• 80-20 and KISS
www.ci.anl.govwww.ci.uchicago.edu
42
Goals of Workflow Project in CVRG
• Deploy existing technology on the CVRG that can be used to store and execute workflows generated locally using the Taverna workbench
• Develop new technology that allows non-expert users to graphically compose and execute workflows via a web-interface.
• Extend the Taverna Engine and add support to invocation of REST-style services so that users can annotate workflow inputs and outputs using ontology terms from NCBO Bioportal and other ontology repositories
• Develop specifications describing how workflows should be designed, validated, and documented, and support user development of workflows.
• Extend the technology so that workflows can be executed in a cloud-computing environment
www.ci.anl.govwww.ci.uchicago.edu
43
Suggested Direction
• Hosted Workflow Solution– SaaS workflow tools• Globus Online• Galaxy
www.ci.anl.govwww.ci.uchicago.edu
44
Acknowledgements
• Univ. Chicago / ANL– Ian Foster– Dinanath Sulakhe– Bo Liu
• Univ. Manchester, UK– Carole Goble– Stian Soiland-Reyes– Alexandra Nenadic
• Inventrio – Shannon Hastings– Stephen Langella– Scott Oster
• Other colleagues from Ohio State University, National Cancer Institute, JHU …
www.ci.anl.govwww.ci.uchicago.edu
45
Journal papers & book chapters• Composition as a Service. IEEE Internet Computing. 2010• A Comparison of Using Taverna and BPEL in Building Scientific Workflows: the
case of caGrid. CCPE. 2010.• Data-driven Service Composition in Building SOA Solutions: A Petri Net
Approach. IEEE T-ASE, 2010• Scientific workflows that enable Web-scale collaboration: combining the
power of Taverna and caGrid. IEEE Internet Computing. 2008• Workflow in a Service Oriented Cyberinfrastructure Environment. in: Junwei
Cao (Ed.). Cyberinfrastructure Technologies and Applications. Nova Science Publishers, 2008. (book chapter)
www.ci.anl.govwww.ci.uchicago.edu
46
Conference papers• Scientific workflows as services in caGrid: a Taverna and gRAVI
approach. ICWS 2009• Wrap Scientific Applications as WSRF Grid Services using gRAVI.
ICWS 2009• Orchestrating caGrid Services in Taverna. ICWS 2008• Building Scientific Workflow with Taverna and BPEL: a
Comparative Study in caGrid. WESOA 2008• Build Grid Enabled Scientific Workflows using gRAVI and Taverna.
SWBES 2008
www.ci.anl.govwww.ci.uchicago.edu
47
Contact information
• Ravi Madduri– [email protected]
• Computation Institute, Univ. Chicago– http://www.ci.uchicago.edu/