KarmaKarmaProvenance Collection Provenance Collection Framework for Data-driven Framework for Data-driven WorkflowsWorkflows
Yogesh SimmhanMicrosoft Research
Beth Plale, Dennis Gannon, Ai Zhang, Girish Subramanian, Abhijit Borude, et alIndiana University
Putting the ‘e’ in e-Putting the ‘e’ in e-ScienceScienceMany scientific domains are moving to in Silico
experiments…Earth Sciences, Life Sciences, Astronomy
Common requirements◦ Complex & Dynamic Systems, ◦ Adaptive Resources◦ Data Deluge◦ Need for Collaboration
Cyberinfrastructure to support these needs◦ Massively Parallel Systems◦ High Bandwidth Computer Networks◦ Petascale Data Archives
Grid Middleware provides the glue to tie these using a Service Oriented Architecture
Workflows as ExperimentsWorkflows as ExperimentsData-driven applications designed
as workflowsData flows across applications as
they are transformed, fused and used generating derived data
Control flows determine path to execute but data flow determines data movement and dependency
Manually keeping track of input & derived data to experiments is challenging given the number of data and complexity of application
Data Management Data Management ChallengesChallengesComplex, dynamic data-processing pipelinesRemote execution on Grid resourcesHow was a particular dataset created?
Collaboratory environments with shared resources
Large search space & missing metadataHow good is a given dataset for one’s
application?
Data ProvenanceData Provenance
Metadata that describes the causality of an event◦ Along with context to interpret it
What, when, where, who, how, …We consider provenance for
◦ Workflow execution◦ Service invocations◦ Data products
Workflow & Service Provenance◦ Describes execution of a workflow & invocation of
serviceData Provenance
◦ Describes usage and generation of data products
Provenance /’prɒv ə nəns, -,nɑns/ The history or pedigree of a work of art, manuscript, etc. A record of the ultimate derivation and passage of an item through its various owners.Source: The Oxford English Dictionary
BenefitsBenefitsWhat if the experiment fails?
◦ Did the workflow run correctly? Completely?◦ Was the correct data/service/parameter used?◦ Verification, Validation
Can my peer run the experiment & get the same result?◦ Repeatability
Can I use the results in my publication?◦ Attribution, Copyright
Can I trust the results of prediction?◦ Data Quality
How much did it cost? How much will it cost?◦ Resource Usage & Prediction
[7/43] [2007-08-16]
Gateway ServicesGateway Services
Core Grid ServicesCore Grid Services
LEAD Science Gateway LEAD Science Gateway ArchitectureArchitecture
Grid Portal Server
Grid Portal Server
ExecutionManagement
ExecutionManagement
InformationServices
InformationServices
SelfManagement
SelfManagement
DataServices
DataServices
ResourceManagement
ResourceManagement
SecurityServices
SecurityServices
Resource Virtualization Resource Virtualization (OGSA)(OGSA)
Compute Resources Data Resources Instruments & Sensors
Proxy CertificateServer (Vault)
Proxy CertificateServer (Vault)
Events & Messaging
Events & Messaging
Resource BrokerResource Broker
Community & User Metadata Catalog
Community & User Metadata Catalog
Workflow engine
Workflow engine Resource
Registry
Resource Registry
ApplicationDeployment
ApplicationDeployment
User’s Grid DesktopUser’s Grid Desktop
What is KarmaWhat is KarmaProvenance Framework A standalone framework to collect data provenance for adaptive workflows with low overhead and lightweight
schema able to answer complex queries Data Provenance is
a form of metadatato track derivation history of datacreated by a workflow runexecuting across organizations (space)over a period of time
Data Usage: Move forward in time Workflow trace: Inverse view from the
actors
A Typical e-Science ExperimentWeather forecast using WRF in LEAD
Pre-ProcessingPre-Processing AssimilationAssimilation VisualizatioVisualizationn
ForecastForecast
WorkflowsWorkflows
Abstract Workflow ModelAbstract Workflow ModelTemporal & Spatial composition
◦ Data Flow vs. Invocation Flow
Central vs. Distributed Orchestration
AssumptionDirected Graph of Service Nodes & Data Edges
◦ Data Driven ApplicationsHierarchical Composition: Workflows a form of
ServiceWorkflow definition not required
Standalone, independent of Workflow System
Provides Port
Uses Port
Data Flow
Workflows Workflows
Simple & Complex Workflow Simple & Complex Workflow ModelsModels
Workflow Engine
ServiceS2
ServiceS1
D1 D2 D3
WorkflowWF
D1 D3
Workflow Engine
ServiceS2
ServiceS1
D1 D2 D3
WorkflowWF
D1 D3
D1
ServiceS1
D2
WorkflowWF1
D1
WorkflowWF2
ServiceS3
D2
ServiceS2
D3 D4
[12/43] [2007-08-16]
Provenance Framework in Support of
Data
Quality
Estimation
Activities Activities
Collecting ProvenanceCollecting ProvenanceActivities generated during lifecycle of
workflow“Sensors” generate activities: Instrumentation
of services, clientsTrack execution across space, time, depth &
operation◦ Space: which service◦ Time: when (logical time)◦ Depth: distance from invocation root (client »
workflow » service … nested workflows)◦ Operation: Track dataflow
18 activities definedSupport Dynamic, Adaptive Workflow
WF Engine
Web Service
Instrumentation of Services Instrumentation of Services & WF& WF
WS Client
Karma Provenance ServiceKarma Provenance Service
ProvenanceListener
ProvenanceListener
ActivityDB
ActivityDB
Karma ArchitectureKarma Architecture
Workflow Instance10 Data Products Consumed & Produced by each Service
Workflow Instance10 Data Products Consumed & Produced by each Service
Service2
Service2 ……Service
1Service
1Service
10Service
10Service
9Service
910P/10C
10C
10P 10C 10P/10C
10P
Workflow Engine
Workflow Engine
Message Bus WS-Eventing Service API Message Bus WS-Eventing Service API WS-Messenger
Notification BrokerWS-Messenger
Notification Broker
Publish Provenance Activities as Notifications
Application–Started & –Finished, Data–Produced & –ConsumedActivities
Workflow–Started & –Finished Activities
ProvenanceQuery API
ProvenanceQuery API
Provenance Browser ClientProvenance
Browser Client
Query for Workflow, Process,& Data Provenance
Subscribe & Listen toActivity Notifications
A Framework for Collecting Provenance in Data-Centric Scientific WorkflowsA Framework for Collecting Provenance in Data-Centric Scientific Workflows, Simmhan, Y., et al.; ICWS, 2006
Service Invocation State Service Invocation State DiagramDiagram
Service Invoked
Data Transfer
InComputati
on
Data Consume
d
Data Produced
Data Transfer
Out
Sending Result
SERVICE
CLIENT
Activities Activities
Types & SourceTypes & SourceActivity Generated By
[Service | Workflow] Initialized Service
[Service | Workflow] Terminated Service
Invoking Service Client
Service Invoked Service
Invoking Service [Succeeded | Failed] Client
Data Transfer Service
Computation Service
Data Produced Service
Data Consumed Service
Sending [Result | Fault] Service
Received [Result | Fault] Client
Sending Response [Succeeded | Failed] Service
Type
Independent
Independent
Bounding
Bounding
Bounding
Operational
Operational
Operational
Operational
Bounding
Bounding
Bounding
[17/43][2007-08-
16]Provenance Framework in Support of Data Quality Estimation
Client ServiceD1
D2
Tim
e
Space Operation
S: Initialize
S: Terminate
S: Send Response Successful
C: Receive Response
S: Send Response
S: Transfer Output Data D2
S: Produce Data D2
S: Perform Computation
S: Consume Data D1
S: Transfer Input Data D1
C: Invocation Successful
S: Invoked
C: Invoke Service
TransferConsume
ProduceCompute
Client Service
Depth
Activities Sequence Diagram for Basic Workflow
[18/43] [2007-08-16]
Provenance Framework in Support of
Data
Quality
Estimation
Workflow Engine
ServiceS2
D1
D2
ServiceS1
D2
D3
D1 D2 D3
WorkflowWF
D1 D3
Tim
e
Operation
S1,S2,WF: Initialize
S1,S2,WF: Terminate
S1: Send Response Successful
WF: Receive Response
S1: Send Response
S1: Produce Data D2
S1: Consume Data D1
WF: Invocation Successful
S1: Invoked
WF: Invoke Service S1
ConsumeProduce
WF S1 S2
S2: Send Response Successful
WF: Receive Response
S2: Send Response
S2: Produce Data D3
S2: Consume Data D2
WF: Invocation Successful
S2: Invoked
WF: Invoke Service S2
Space
DepthSequence Diagram for Simple Workflow
[19/43] [2007-08-16]
Provenance Framework in Support of
Data
Quality
Estimation
Activities Activities
NamingNamingUniquely identifying data & services is
critical for provenanceData product has GUID. Replicas have
URLs.Service & Workflow instances have GUIDServices defined in the context of
workflows have a Node ID in the workflow name space
Clients have GUIDEntity: 4-tuple
◦ <Workflow ID, Service ID, Node ID, Timestep>
Invocation: 2-tuple◦ <Invoker Entity, Invokee Entity>
Activities Activities
Provenance Activity ContentsProvenance Activity ContentsActivity TypeSource Entity: 4-tuple
◦<Workflow ID, Service ID, Node ID, Timestep>
Remote Entity: 4-tupleAttributes
◦todoAnnotations
Activities Activities
Modeling Activities in XMLModeling Activities in XML<serviceInvoked xmlns=“http://lead.extreme.indiana.edu/namespaces/2006/06/workflow_tracking”> <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” serviceID=“urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService” /> <timestamp>2006-09-10T23:56:28.677Z</timestamp> <description>Convert Service was Invoked</description> <request><header>...</header><body>...</body></request> <initiator serviceID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1” /> </serviceInvoked>
<dataProduced xmlns=“http://lead.extreme.indiana.edu/namespaces/2006/06/workflow_tracking”> <notificationSource workflowNodeID=“ConvertService_4” workflowTimestep=“36” workflowID=“tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/
instance1”
serviceID=“urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService” />
<timestamp>2006-09-10T23:56:32.324Z</timestamp> <dataProduct> <id>lead:uuid:1157946992-atlas-x.gif</id> <location> gsiftp://tyr1.cs.indiana.edu/tmp/20060910235628_Convert/outputData/atlas-x.gif</
location> <timestamp>2006-09-10T23:56:32.324Z</timestamp> </dataProduct></dataProduced>
Activities Activities
Publishing Activities as Publishing Activities as NotificationsNotificationsActivities are modeled as notifications
that are sent by different components◦Loosely coupled, easy to generate
provenanceXML Representation of provenance
activitiesWS-Messenger Notification Broker
acts as message bus◦WS-Eventing & WS-Notification
Provenance service & interested clients subscribe to notification
Backend Backend
Provenance DatabaseProvenance Database~Union of provenance modelProvenance incrementally builtRelational database (MySQL)
Information ModelInformation Model
Data Provenance ViewData Provenance View
Data ProvenanceEntity is the state of a service or a client Invocation relates a client (invoker) to a service
(invokee). Status.Data provenance of produced data relates
invocation with consumed data
Lightweight schemaKarma2: Provenance Management for Data Driven WorkflowsKarma2: Provenance Management for Data Driven Workflows, Simmhan, Y., et al.; J. Web Svc. Res., 2008
ClientENTITY (Invoker)
ServiceENTITY (Invokee)
Request
Response
Information Model Information Model
Data Provenance & Usage Data Provenance & Usage ViewsViews
ClientENTITY (Invoker)
ServiceENTITY (Invokee)
Request
Response
Information ModelInformation Model
Workflow & Process Provenance Workflow & Process Provenance ViewsViews
ClientENTITY (Invoker)
ServiceENTITY (Invokee)
Request
Response
DisseminationDissemination
Querying ProvenanceQuerying ProvenanceAll 5 provenance models can be queried for
by ID◦ Data Provenance (by Data ID)◦ Recursive Data Provenance (by Data ID, depth)◦ Data Usage (by Data ID)◦ Process provenance (by Invoker & Invokee)◦ Workflow Trace (by Invoker & Invokee, depth)
Service API to query and return results as XML Document
Provenance Challenge Workshop◦Direct API, Incremental client, Graph
matching algorithm
Incremental building of complex queries
Query Capabilities of the Karma Provenance FrameworkQuery Capabilities of the Karma Provenance Framework, Simmhan, Y., et al.; 1st Provenance Challenge & CCPE J., 2007
Applications: Process MonitoringApplications: Process Monitoring
Realtime Monitoring using Realtime Monitoring using XBayaXBaya
Applications: Information Applications: Information IntegrationIntegrationVisual Exploration using Karma Visual Exploration using Karma GUIGUI
Performance & Scalability Performance & Scalability StudyStudyExperimental SetupExperimental Setup
odin001
odin065
odin064
odin128…
…
Provenance Clients
tyr10 tyr12tyr11
tyr13
Karma WS-MessengerBroker
PReServ in Tomcat 5.0, Embedded Java DB
MySQL
Gbps Network
Dual-Processor 2.0 GHz 64-bit Opteron,4GB RAM
Dual-Processor 2.0 GHz 64-bit Opteron,16GB RAM, Local IDE disk
Generate Provenance
Query Provenance
Karma Service, WS-Messenger Notification Broker, MySQL
PReServ in Tomcat 5.0 container Tyr web-services cluster (16 Nodes) Odin computer cluster (128 Nodes) Gigabit Ethernet, local IDE disk storage SLURM job manager for parallel job submission on Odin Java 1.5, Jython
Provenance Service Components
[31/43] [2007-08-16]
Performance & Scalability Performance & Scalability StudyStudyCollecting ProvenanceCollecting Provenance Comparative Study of
Karma with PReServ (U. Soton)
Provenance services on tyr (2Ghz/16GB/64bit) & clients on odin (2Ghz/4GB/64bit)
Time to collect provenance activities synchronously1.Single service with
increasing number of service invocationsKarma scales linearly
2.Linear workflow with increasing number of data produced/ consumedKarma scales linearly, PReServ constant
Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006
[32/43] [2007-08-16]
Performance & Scalability Performance & Scalability StudyStudyCollecting ProvenanceCollecting Provenance
Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006
Time to collect provenance from simulated ensemble WRF forecasting workflow
Scalability with increasing # of parallel runs
1–20 concurrent workflows
Karma scales sub-linear
[33/43] [2007-08-16]
Performance & Scalability Performance & Scalability StudyStudyQuerying ProvenanceQuerying Provenance
Performance Evaluation of the Karma Provenance FrameworkPerformance Evaluation of the Karma Provenance Framework, Simmhan, Y., et al.; IPAW & LNCS 4145, 2006
Response time to query workflow, process, and data provenance from Karma (PReServ was order of magnitude slower)
Scalability with increasing # of concurrent clients Karma contains 1000 workflow invocations Query for 20 workflow/200 process/200 data provenance
documents
Related WorkRelated WorkPReServ, U. of Southampton (Luc Moreau)
Standalone, Annotation supportNo data provenance, workflow concept; poor
performanceVisTrails, U. of Utah (Juliana Freire)
Workflows for graphical modelingConstrained to browser
PASS, Harvard U. (Margo Seltzer)System level provenanceNo service/data abstraction
Trio, Stanford U. (Jennifer Widom)Tuple level provenance on Database operationsRestricted to databases
Data Collector, IBM (alphaworks)Automatically record & track SOAP MessagesNo data provenance
What is new in KarmaWhat is new in Karma33??Process control flow trackingVertical integration across
applications◦Support for database queries
Process & data abstractionMining provenance logs
◦WF composition◦Semantic support (S-OGSA)