provenance: an open approach to experiment validation in e- science professor luc moreau...
TRANSCRIPT
Provenance: an open approach to experiment validation in e-Science
Professor Luc [email protected] of Southampton
www.ecs.soton.ac.uk/~lavm
Provenance & PASOA Teams
University of Southampton Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco,
Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen IBM UK (EU Project Coordinator)
John Ibbotson, Neil Hardman, Alexis Biller University of Wales, Cardiff
Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari
Universitad Politecnica de Catalunya (UPC) Steven Willmott, Javier Vazquez
SZTAKI Laszlo Varga, Arpad Andics, Tamas Kifor
German Aerospace Andreas Schreiber, Guy Kloss, Frank Danneman
Contents
Motivation Provenance Concepts Provenance Architecture Standardisation Provenance Queries Conclusions
Audit & Business Regulations
Audit: - Sarbanes-Oxley - Basel II - European Rec. R(97)5 (protection of medical data) - ….
Accounting
Banking
Healthcare
Compliance to Regulations
The “next-compliance” problem Can we be certain
that by ensuring compliance to a new regulation, we do not break previous compliance?
Current Solutions
Proprietary, Monolithic
Silos, Closed Do not inter-operate
with other applications
Not adaptable to new regulations
Provenance
Oxford English Dictionary: the fact of coming from some particular
source or quarter; origin, derivation the history or pedigree of a work of art,
manuscript, rare book, etc.; concretely, a record of the passage of an item through its various owners.
Concept vs representation
Provenance in Computer Systems
Our definition of provenance in the context of applications for which process matters to end users:
The provenance of a piece of data is the process that led to that piece of data
Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases
Our Approach
Define core concepts pertaining to provenance
Specify functionality required to become “provenance-aware”
Define open data models and protocols that allow systems to inter-operate
Standardise data models and protocols Provide a reference implementation Provide reasoning capability
Context (1)
Aerospace engineering: maintain a historical record of design processes, up to 99 years.
Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients
Context (2)
High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)
Bioinformatics: verification and auditing of “experiments” (e.g.for drug approval)
Provenance “Lifecycle”
ApplicationApplication
Data Results
ProvenanceStore
Record Documentation of Execution
Query andReason overProvenance
of Data
AdministerStore and itscontents
Core Interfaces to Provenance Store
Nature of Documentation
We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or
partial; it can be accurate or inaccurate; it can present conflicting or
consensual views of the actors involved;
it can provide operational details of execution or it can be abstract.
p-assertion
A given element of process documentation will be referred to as a p-assertion p-assertion: is an assertion that
is made by an actor and pertains to a process.
Service Oriented Architecture
Broad definition of service as component that takes some inputs and produces some outputs.
Services are brought together to solve a given problem typically via a workflow definition that specifies their composition.
Interactions with services take place with messages that are constructed according to services interface specification.
The term actor denotes either a client or a service in a SOA.
A process is defined as execution of a workflow
M1
M2
M3
M4
Actor 1 Actor 2
I received M1, M4I sent M2, M3
I received M3I sent M4
From these p-assertions, we can derive that M3 was sent by Actor 1and received by Actor 2 (and likewise for M4)
If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages
Process Documentation (1)
M1
M2
M3
M4
Actor 1 Actor 2
M2 is in reply to M1M3 is caused by M1M2 is caused by M4
M4 is in reply to M3
These assertions help identify order of messages,but not how data was computed
Process Documentation (2)
f
M1
M2
M3
M4
Actor 1 Actor 2
f1
f2
M3 = f1(M1)M2 = f2(M1,M4) M4 = f(M3)
These assertions help identify how data is computed,but provide no information about non-functional characteristics of the computation(time, resources used, etc)
Process Documentation (3)
M1
M2
M3
M4
Actor 1 Actor 2
I used 386 clusterRequest sat inqueue for 6min
I used sparc processor
I used algorithm x version x.y.z
Process Documentation (4)
Types of p-assertions (1)
Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message
I received M1, M4I sent M2, M3
Types of p-assertions (2)
Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data)
M2 is in reply to M1M3 is caused by M1M2 is caused by M4
M3 = f1(M1)M2 = f2(M1,M4)
Types of p-assertions (3)
Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction
I used sparc processor
I used algorithm xversion x.y.z
Data flow
Interaction p-assertions allow us to specify a flow of data between actors
Relationship p-assertions allow us to characterise the flow of data “inside” an actor
Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result
Interfaces to Provenance Store
ApplicationApplication
Results
ProvenanceStore
Record Documentation of Execution
Query andReason overProvenance
of Data
AdministerStore and itscontents
The p-structure (1)
The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors
Hierarchical Indexed by interactions (interaction= 1 message
exchange)
Sender’s viewReceiver’s view
The p-structure (2)
All p-assertionsasserted by a givenactor participatingin an interaction
Asserter identity
Recording Protocol (Groth04-06)
Abstract machines DS Properties
Termination Liveness Safety Statelessness
Documentation Properties Immutability Attribution Datatype safety
Foundation for adding necessary cryptographic techniques
Querying Functionality (Miles06)
Process Documentation Query Interface: allows for “navigation” of the documentation of execution Allows us to view the provenance store (i.e.
the p-structure) as if containing XML data structures
Independent of technology used for running application and internal store representation
Seamless navigation of application dependent and application independent process documentation
Querying Functionality (Miles06)
Provenance Query Interface: allows us to obtain the provenance of some specific data
A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest
Hence, provenance is seen as the result of a query: Identify a piece of data at a specific execution point Scope of the process of interest:
Filter in/out p-assertions according to actors, process, types of relationships, etc
Available Software
PReServ (Paul Groth & Simon Miles)
Offer recording and querying interfaces
Available from www.pasoa.org
OGSA-DAI based version available from www.gridprovenance.org
Is being used in a bioinformatics application (cf. hpdc’05, iswc’05)
AxisHandler
AxisHandler
Provenance Store
Backend Interface
XML Rel DB
…Backend Stores
PS Client Side
Library
PS Client Side
Library
Web ServiceWS Client
Query Actor WS
PS Client Side
Library
WS Calls
Java Calls
AxisHandler
AxisHandler
AxisHandler
AxisHandler
Provenance Store
Backend Interface
XML Rel DB
…Backend Stores
Backend Interface
XML Rel DB
…Backend Stores
PS Client Side
Library
PS Client Side
Library
PS Client Side
Library
PS Client Side
Library
Web ServiceWS Client Web ServiceWS Client
Query Actor WS
PS Client Side
Library
Query Actor WS
PS Client Side
Library
WS Calls
Java Calls
Provenance Store Components
PStoreDatabaseOGSA-DAI
ClientAPI
ProvenanceStoreResource
PStoreDatabaseOGSA-DAI
ClientAPI
ProvenanceStoreResource
eXistXML
Database
OGSA-DAI
Globus GT4 Container
Globus GT4Container
External Security Services
ProvenanceStoreFactoryFactory
PStoreDatabaseOGSA-DAI
ClientAPI
ProvenanceServiceResource
ProvenanceServiceResourceHome
Uses
UsesManages
ProvenanceService
Destroy
Record
PQuery
XPath
XQuery
Iterate
Resources
Actor CSL
Slide from John Ibbotson
ProvenanceStoreFactory
Provenance Store Security
Provenance GT4 Container
PolicyDecision
Point
PolicyDecision
Point
ACLFile
(XML)
Approve
Approve
Request
Deny
Deny
ProvenanceService
Destroy
Record
PQuery
XPath
XQuery
Iterate
Resources
Factory
Actor CSL
Slide from John Ibbotson
Provenance Implementation
The Client Side Library exposes Provenance Store functionality and separates Actor from alternative Server side implementations EU Provenance project implementation PASOA PreServ
Security is being extended to allow federation using Globus Community Authorization Service (CAS)
Slide from John Ibbotson
Standardisation Options
APIsProgrammatic
inter-op
Recording and querying
InterfacesService inter-op
Provenance ModelData inter-op
Purpose of Standardisation
ApplicationApplication
ProvenanceStores
Record Documentation of Execution
ApplicationApplication
Allow for multiple applications to document their execution.Applications may be running in different institutions.
Purpose of Standardisation
ApplicationApplication
ProvenanceStore
Record Documentation of Execution
Allow for multiple stores from multiple IT providers
ProvenanceStore
ProvenanceStore
Purpose of Standardisation
ProvenanceStore Query
Provenanceof Data
Allow for multiple stores from multiple IT providers
ProvenanceStore
Purpose of Standardisation
Allow for legacy, monolithic applications to expose theircontents (according to standard schema)
Convert in standard dataformat
Purpose of Standardisation
Allow third parties to host provenance stores, which are trusted by application owners but also auditors
ApplicationApplication
ProvenanceStore
Compliance Oriented Architectures
Separate execution documentation from compliance verification
Allows for multiple compliance verifications
Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting).
Approach is suitable for e-scientific peer-reviewing and business compliance verification
ApplicationApplication
ProvenanceStore
Record Documentationof Execution
QueryProvenance
Of Data
Complianceverification
ApplicationApplication
ProvenanceStore
Record Documentationof Execution
Record Documentationof Execution
QueryProvenance
Of Data
ComplianceverificationComplianceverification
Standardisation Philosophy
Thin layer common between systems: extensible data model
Model can be extended for specific: technologies (WS, Web, …), or application domains (Bio, Healthcare,
Desktop, …) Service interfaces
WS-Prov-Intro
WS-Prov-DM
WS-Prov-Glo
WS-Prov-Rec WS-Prov-Query
WS-Prov-DM-Link
WS-Prov-DM-Infer
WS-Prov-DM-DS
Generic Profiles Domain Specific Profiles
WS-Prov-SOAP
Technology Bindings
WS-Prov-DM-Sec
WS-Prov-WWW
WS-Prov-DM-Rel
WS-Prov-Primer
Proposed List of Specifications
Example Application
GUI Averager Divider
Store
1. average (7, 5)
2. divide (12, 2)
3. answer (6) 4. answer (6)
5. store (“6”, file1) Averager(in1,in2) {
return (in1+in2)/2;
}
Averager delegates the division operation to the service Divider
Example Application
GUI Averager Divider
Store
1. average (7, 5)
2. divide (12, 2)
3. answer (6) 4. answer (6)
5. store (“6”, file1)
Relationships• 12 in msg 2 is sum of 7, 5 in msg 1• 6 in msg 3 is division of 12, 2 in msg 2• 6 in msg 4 is copy of 6 in msg 3• 6 in msg 4 is average of 7, 5 in msg 1• 6 in msg 6 is copy of 6 in msg 4
Tracers• are used to demarcate activities (aka sets of services)• added by Averager in call to Divider• returned by Divider in response
The data we want to find the provenance of
Identify the event where the entity is documented:
In this case, the event is the receipt of a request to store the data in file named file1
Identify the data entity within that message In this case, the data of interest is the “6”
stored in file1
“file1” Store
Provenance Graph
“6” StoreGUI
“6” GUIAverager
“6” AveragerDivider
“12” DividerAverager “2” DividerAverager
“5” AveragerGUI“7” AveragerGUI
Copy of
Copy of
Division of
DividendDivisor
Sum of
Average of
Scoped Provenance Graph
“6” StoreGUI
“6” GUIAverager
“6” AveragerDivider
“12” DividerAverager “2” DividerAverager
“5” AveragerGUI“7” AveragerGUI
Copy of
Copy of
Division of
DividendDivisor
Sum of
Average of
Filter to exclude“Average of”relationships
Allows us to ignore the high level structure of the computation and to focus on the actual operations
e.g. allows us to establish what a given provider actually does
Scoped Provenance Graph
“6” StoreGUI
“6” GUIAverager
“6” AveragerDivider
“12” DividerAverager “2” DividerAverager
“5” AveragerGUI“7” AveragerGUI
Copy of
Copy of
Division of
DividendDivisor
Sum of
Average of
Filter to excludemessages containingtracer
This is equivalent tohiding the internaloperation of Averager
Allows us to consider a given service (and all its inferior invocations) as a black box: high level account of provenance
e.g. no detail should be provided about the internals of Averager
Scoped Provenance Graph
“6” StoreGUI
“6” GUIAverager
“6” AveragerDivider
“12” DividerAverager “2” DividerAverager
“5” AveragerGUI“7” AveragerGUI
Copy of
Copy of
Division of
DividendDivisor
Sum of
Average of
Filter to excludeDivisor parameters
Allows us to scope the provenance graph according to types of data or operations
e.g. looking at the restorations of a painting rather than its various owners
Practically …
Event and Data Identification//ps:interactionRecord
[ps:interactionKey/ps:messageSink/wsa:EndpointReference/
wsa:Address="http://www.example.com/store"]
The interaction record in which the receiver (messageSink) hasaddress http://www.example.com/store
//ps:interactionPAssertion [ex:envelope/ex:store/ex:location="/home/sm/data/file1"]
//ex:envelope/ex:store/ex:data
Event identification
Data identification
Practically …
The scope of the provenance query
Unscoped query/
Exclude ‘averageOf’ relation/pq:relationshipTarget[ps:relation!="http://www.example.com#averageOf"]
Exclude tracer introduced by Averager/pq:relationshipTarget/ps:interactionPAssertion[not(ex:envelope/ph:pheader/ph:interactionMetaData[ph:tracer="process://sub/1"])]
Provenance of Donor Diagnosis Request
DiagnoseRequest Decision Maker
Donor DataCollector
DiagnoseRequest Donor Data
CollectorBrain Death
Manager
Brain DeathManager
Testing Lab
DataCollectionComplete Brain Death
ManagerHealthcare
Record Manager
Patient(in Brain DeathNotification) Brain Death
ManagerUser Interface
Was Caused By
Is Diagnosis Request For
Patient
TestResults
Test ResultsDonor Data Collection
Data CollectionRequest Healthcare
Record ManagerDonor Data
Collector
EHCR HealthcareRecord Manager
EHCRS
EHCRRequest
EHCRSHealthcare
Record Manager
Includes Data
Is Response To
Was Caused By
Is Response To
ProvenanceStore
Reco
rd
To Sum Up
Query
Compliance check Rerun/Reproduce Analyse
Standardising thedocumentation of
Business Processes
Provenance Architecture Methodology
Apply
Healthcare
DistributionFinance
Aerospace
Automobile
Pharmaceutical
Slide from John Ibbotson
Conclusions
Crucial topic for many applications Full architectural specification An implementation available for
download Methodology to make application
provenance-aware www.pasoa.org www.gridprovenance.org
Publications1. Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter
Zauner, and Luc Moreau. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), July 2005.
2. Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), Grenoble, France, December 2004.
3. Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording provenance in Grids. In Proceedings of the UK OST e-Science second All Hands Meeting 2004 (AHM'04), Nottingham, UK, September 2004.
4. Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of recording and using provenance in e-Science experiments. Technical report, University of Southampton, 2005.
5. Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based Trust for Grid Computing --- Position Paper. In , 2003.
6. Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications. In Proc. of the 8th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), May 2005.
7. Paul Groth, Simon Miles, Victor Tan, and Luc Moreau. Architecture for Provenance Systems. Technical report, University of Southampton, October 2005.