provenance: an open approach to experiment validation in e- science professor luc moreau...

68
Provenance: an open approach to experiment validation in e-Science Professor Luc Moreau [email protected] University of Southampton www.ecs.soton.ac.uk/~lavm

Upload: merilyn-morgan

Post on 02-Jan-2016

216 views

Category:

Documents


0 download

TRANSCRIPT

Provenance: an open approach to experiment validation in e-Science

Professor Luc [email protected] of Southampton

www.ecs.soton.ac.uk/~lavm

Provenance & PASOA Teams

University of Southampton Luc Moreau, Paul Groth, Simon Miles, Victor Tan, Miguel Branco,

Sofia Tsasakou, Sheng Jiang, Steve Munroe, Zheng Chen IBM UK (EU Project Coordinator)

John Ibbotson, Neil Hardman, Alexis Biller University of Wales, Cardiff

Omer Rana, Arnaud Contes, Vikas Deora, Ian Wootten, Shrija Rajbhandari

Universitad Politecnica de Catalunya (UPC) Steven Willmott, Javier Vazquez

SZTAKI Laszlo Varga, Arpad Andics, Tamas Kifor

German Aerospace Andreas Schreiber, Guy Kloss, Frank Danneman

Contents

Motivation Provenance Concepts Provenance Architecture Standardisation Provenance Queries Conclusions

Motivation

Scientific Research

Academic Peer Review

Audit & Business Regulations

Audit: - Sarbanes-Oxley - Basel II - European Rec. R(97)5 (protection of medical data) - ….

Accounting

Banking

Healthcare

e-Science datasets

How to undertake peer-reviewing and validation of e-Scientific results?

Compliance to Regulations

The “next-compliance” problem Can we be certain

that by ensuring compliance to a new regulation, we do not break previous compliance?

Current Solutions

Proprietary, Monolithic

Silos, Closed Do not inter-operate

with other applications

Not adaptable to new regulations

Provenance

Oxford English Dictionary: the fact of coming from some particular

source or quarter; origin, derivation the history or pedigree of a work of art,

manuscript, rare book, etc.; concretely, a record of the passage of an item through its various owners.

Concept vs representation

Provenance in Computer Systems

Our definition of provenance in the context of applications for which process matters to end users:

The provenance of a piece of data is the process that led to that piece of data

Our aim is to conceive a computer-based representation of provenance that allows us to perform useful analysis and reasoning to support our use cases

Our Approach

Define core concepts pertaining to provenance

Specify functionality required to become “provenance-aware”

Define open data models and protocols that allow systems to inter-operate

Standardise data models and protocols Provide a reference implementation Provide reasoning capability

Context (1)

Aerospace engineering: maintain a historical record of design processes, up to 99 years.

Organ transplant management: tracking of previous decisions, crucial to maximise the efficiency in matching and recovery rate of patients

Context (2)

High Energy Physics: tracking, analysing, verifying data sets in the ATLAS Experiment of the Large Hadron Collider (CERN)

Bioinformatics: verification and auditing of “experiments” (e.g.for drug approval)

Provenance Concepts

Provenance “Lifecycle”

ApplicationApplication

Data Results

ProvenanceStore

Record Documentation of Execution

Query andReason overProvenance

of Data

AdministerStore and itscontents

Core Interfaces to Provenance Store

Nature of Documentation

We represent the provenance of some data by documenting the process that led to the data: documentation can be complete or

partial; it can be accurate or inaccurate; it can present conflicting or

consensual views of the actors involved;

it can provide operational details of execution or it can be abstract.

p-assertion

A given element of process documentation will be referred to as a p-assertion p-assertion: is an assertion that

is made by an actor and pertains to a process.

Service Oriented Architecture

Broad definition of service as component that takes some inputs and produces some outputs.

Services are brought together to solve a given problem typically via a workflow definition that specifies their composition.

Interactions with services take place with messages that are constructed according to services interface specification.

The term actor denotes either a client or a service in a SOA.

A process is defined as execution of a workflow

M1

M2

M3

M4

Actor 1 Actor 2

I received M1, M4I sent M2, M3

I received M3I sent M4

From these p-assertions, we can derive that M3 was sent by Actor 1and received by Actor 2 (and likewise for M4)

If actors are black boxes, these assertions are not very useful because we do not know dependencies between messages

Process Documentation (1)

M1

M2

M3

M4

Actor 1 Actor 2

M2 is in reply to M1M3 is caused by M1M2 is caused by M4

M4 is in reply to M3

These assertions help identify order of messages,but not how data was computed

Process Documentation (2)

f

M1

M2

M3

M4

Actor 1 Actor 2

f1

f2

M3 = f1(M1)M2 = f2(M1,M4) M4 = f(M3)

These assertions help identify how data is computed,but provide no information about non-functional characteristics of the computation(time, resources used, etc)

Process Documentation (3)

M1

M2

M3

M4

Actor 1 Actor 2

I used 386 clusterRequest sat inqueue for 6min

I used sparc processor

I used algorithm x version x.y.z

Process Documentation (4)

Types of p-assertions (1)

Interaction p-assertion: is an assertion of the contents of a message by an actor that has sent or received that message

I received M1, M4I sent M2, M3

Types of p-assertions (2)

Relationship p-assertion: is an assertion, made by an actor, that describes how the actor obtained an output message sent in an interaction by applying some function to input messages from other interactions (likewise for data)

M2 is in reply to M1M3 is caused by M1M2 is caused by M4

M3 = f1(M1)M2 = f2(M1,M4)

Types of p-assertions (3)

Actor state p-assertion: assertion made by an actor about its internal state in the context of a specific interaction

I used sparc processor

I used algorithm xversion x.y.z

Data flow

Interaction p-assertions allow us to specify a flow of data between actors

Relationship p-assertions allow us to characterise the flow of data “inside” an actor

Overall data flow (internal + external) constitutes a DAG, which characterises the process that led to a result

Provenance Architecture

Interfaces to Provenance Store

ApplicationApplication

Results

ProvenanceStore

Record Documentation of Execution

Query andReason overProvenance

of Data

AdministerStore and itscontents

P-Assertion schemas

The p-structure (1)

The p-structure is a common logical structure of the provenance store shared by all asserting and querying actors

Hierarchical Indexed by interactions (interaction= 1 message

exchange)

Sender’s viewReceiver’s view

The p-structure (2)

All p-assertionsasserted by a givenactor participatingin an interaction

Asserter identity

Recording Protocol (Groth04-06)

Abstract machines DS Properties

Termination Liveness Safety Statelessness

Documentation Properties Immutability Attribution Datatype safety

Foundation for adding necessary cryptographic techniques

Querying Functionality (Miles06)

Process Documentation Query Interface: allows for “navigation” of the documentation of execution Allows us to view the provenance store (i.e.

the p-structure) as if containing XML data structures

Independent of technology used for running application and internal store representation

Seamless navigation of application dependent and application independent process documentation

Querying Functionality (Miles06)

Provenance Query Interface: allows us to obtain the provenance of some specific data

A recognition that there is not “one” provenance for a piece of data, but there may be different, depending on the end-user’s interest

Hence, provenance is seen as the result of a query: Identify a piece of data at a specific execution point Scope of the process of interest:

Filter in/out p-assertions according to actors, process, types of relationships, etc

Available Software

PReServ (Paul Groth & Simon Miles)

Offer recording and querying interfaces

Available from www.pasoa.org

OGSA-DAI based version available from www.gridprovenance.org

Is being used in a bioinformatics application (cf. hpdc’05, iswc’05)

AxisHandler

AxisHandler

Provenance Store

Backend Interface

XML Rel DB

…Backend Stores

PS Client Side

Library

PS Client Side

Library

Web ServiceWS Client

Query Actor WS

PS Client Side

Library

WS Calls

Java Calls

AxisHandler

AxisHandler

AxisHandler

AxisHandler

Provenance Store

Backend Interface

XML Rel DB

…Backend Stores

Backend Interface

XML Rel DB

…Backend Stores

PS Client Side

Library

PS Client Side

Library

PS Client Side

Library

PS Client Side

Library

Web ServiceWS Client Web ServiceWS Client

Query Actor WS

PS Client Side

Library

Query Actor WS

PS Client Side

Library

WS Calls

Java Calls

Provenance Store Components

PStoreDatabaseOGSA-DAI

ClientAPI

ProvenanceStoreResource

PStoreDatabaseOGSA-DAI

ClientAPI

ProvenanceStoreResource

eXistXML

Database

OGSA-DAI

Globus GT4 Container

Globus GT4Container

External Security Services

ProvenanceStoreFactoryFactory

PStoreDatabaseOGSA-DAI

ClientAPI

ProvenanceServiceResource

ProvenanceServiceResourceHome

Uses

UsesManages

ProvenanceService

Destroy

Record

PQuery

XPath

XQuery

Iterate

Resources

Actor CSL

Slide from John Ibbotson

ProvenanceStoreFactory

Provenance Store Security

Provenance GT4 Container

PolicyDecision

Point

PolicyDecision

Point

ACLFile

(XML)

Approve

Approve

Request

Deny

Deny

ProvenanceService

Destroy

Record

PQuery

XPath

XQuery

Iterate

Resources

Factory

Actor CSL

Slide from John Ibbotson

Provenance Implementation

The Client Side Library exposes Provenance Store functionality and separates Actor from alternative Server side implementations EU Provenance project implementation PASOA PreServ

Security is being extended to allow federation using Globus Community Authorization Service (CAS)

Slide from John Ibbotson

Standardisation

Standardisation Options

APIsProgrammatic

inter-op

Recording and querying

InterfacesService inter-op

Provenance ModelData inter-op

Purpose of Standardisation

ApplicationApplication

ProvenanceStores

Record Documentation of Execution

ApplicationApplication

Allow for multiple applications to document their execution.Applications may be running in different institutions.

Purpose of Standardisation

ApplicationApplication

ProvenanceStore

Record Documentation of Execution

Allow for multiple stores from multiple IT providers

ProvenanceStore

ProvenanceStore

Purpose of Standardisation

ProvenanceStore Query

Provenanceof Data

Allow for multiple stores from multiple IT providers

ProvenanceStore

Purpose of Standardisation

Allow for legacy, monolithic applications to expose theircontents (according to standard schema)

Convert in standard dataformat

Purpose of Standardisation

Allow third parties to host provenance stores, which are trusted by application owners but also auditors

ApplicationApplication

ProvenanceStore

Compliance Oriented Architectures

Separate execution documentation from compliance verification

Allows for multiple compliance verifications

Allows for validation to take place across multiple applications, possibly run by different institutions (in particular, allows for outsourcing and subcontracting).

Approach is suitable for e-scientific peer-reviewing and business compliance verification

ApplicationApplication

ProvenanceStore

Record Documentationof Execution

QueryProvenance

Of Data

Complianceverification

ApplicationApplication

ProvenanceStore

Record Documentationof Execution

Record Documentationof Execution

QueryProvenance

Of Data

ComplianceverificationComplianceverification

Standardisation Philosophy

Thin layer common between systems: extensible data model

Model can be extended for specific: technologies (WS, Web, …), or application domains (Bio, Healthcare,

Desktop, …) Service interfaces

WS-Prov-Intro

WS-Prov-DM

WS-Prov-Glo

WS-Prov-Rec WS-Prov-Query

WS-Prov-DM-Link

WS-Prov-DM-Infer

WS-Prov-DM-DS

Generic Profiles Domain Specific Profiles

WS-Prov-SOAP

Technology Bindings

WS-Prov-DM-Sec

WS-Prov-WWW

WS-Prov-DM-Rel

WS-Prov-Primer

Proposed List of Specifications

Provenance Queries(Miles’06)

Example Application

GUI Averager Divider

Store

1. average (7, 5)

2. divide (12, 2)

3. answer (6) 4. answer (6)

5. store (“6”, file1) Averager(in1,in2) {

return (in1+in2)/2;

}

Averager delegates the division operation to the service Divider

Example Application

GUI Averager Divider

Store

1. average (7, 5)

2. divide (12, 2)

3. answer (6) 4. answer (6)

5. store (“6”, file1)

Relationships• 12 in msg 2 is sum of 7, 5 in msg 1• 6 in msg 3 is division of 12, 2 in msg 2• 6 in msg 4 is copy of 6 in msg 3• 6 in msg 4 is average of 7, 5 in msg 1• 6 in msg 6 is copy of 6 in msg 4

Tracers• are used to demarcate activities (aka sets of services)• added by Averager in call to Divider• returned by Divider in response

The data we want to find the provenance of

Identify the event where the entity is documented:

In this case, the event is the receipt of a request to store the data in file named file1

Identify the data entity within that message In this case, the data of interest is the “6”

stored in file1

“file1” Store

Provenance Graph

“6” StoreGUI

“6” GUIAverager

“6” AveragerDivider

“12” DividerAverager “2” DividerAverager

“5” AveragerGUI“7” AveragerGUI

Copy of

Copy of

Division of

DividendDivisor

Sum of

Average of

Scoped Provenance Graph

“6” StoreGUI

“6” GUIAverager

“6” AveragerDivider

“12” DividerAverager “2” DividerAverager

“5” AveragerGUI“7” AveragerGUI

Copy of

Copy of

Division of

DividendDivisor

Sum of

Average of

Filter to exclude“Average of”relationships

Allows us to ignore the high level structure of the computation and to focus on the actual operations

e.g. allows us to establish what a given provider actually does

Scoped Provenance Graph

“6” StoreGUI

“6” GUIAverager

“6” AveragerDivider

“12” DividerAverager “2” DividerAverager

“5” AveragerGUI“7” AveragerGUI

Copy of

Copy of

Division of

DividendDivisor

Sum of

Average of

Filter to excludemessages containingtracer

This is equivalent tohiding the internaloperation of Averager

Allows us to consider a given service (and all its inferior invocations) as a black box: high level account of provenance

e.g. no detail should be provided about the internals of Averager

Scoped Provenance Graph

“6” StoreGUI

“6” GUIAverager

“6” AveragerDivider

“12” DividerAverager “2” DividerAverager

“5” AveragerGUI“7” AveragerGUI

Copy of

Copy of

Division of

DividendDivisor

Sum of

Average of

Filter to excludeDivisor parameters

Allows us to scope the provenance graph according to types of data or operations

e.g. looking at the restorations of a painting rather than its various owners

Provenance Query

Practically …

Event and Data Identification//ps:interactionRecord

[ps:interactionKey/ps:messageSink/wsa:EndpointReference/

wsa:Address="http://www.example.com/store"]

The interaction record in which the receiver (messageSink) hasaddress http://www.example.com/store

//ps:interactionPAssertion [ex:envelope/ex:store/ex:location="/home/sm/data/file1"]

//ex:envelope/ex:store/ex:data

Event identification

Data identification

Practically …

The scope of the provenance query

Unscoped query/

Exclude ‘averageOf’ relation/pq:relationshipTarget[ps:relation!="http://www.example.com#averageOf"]

Exclude tracer introduced by Averager/pq:relationshipTarget/ps:interactionPAssertion[not(ex:envelope/ph:pheader/ph:interactionMetaData[ph:tracer="process://sub/1"])]

Provenance of Donor Diagnosis Request

DiagnoseRequest Decision Maker

Donor DataCollector

DiagnoseRequest Donor Data

CollectorBrain Death

Manager

Brain DeathManager

Testing Lab

DataCollectionComplete Brain Death

ManagerHealthcare

Record Manager

Patient(in Brain DeathNotification) Brain Death

ManagerUser Interface

Was Caused By

Is Diagnosis Request For

Patient

TestResults

Test ResultsDonor Data Collection

Data CollectionRequest Healthcare

Record ManagerDonor Data

Collector

EHCR HealthcareRecord Manager

EHCRS

EHCRRequest

EHCRSHealthcare

Record Manager

Includes Data

Is Response To

Was Caused By

Is Response To

Conclusions

ProvenanceStore

Reco

rd

To Sum Up

Query

Compliance check Rerun/Reproduce Analyse

Standardising thedocumentation of

Business Processes

Provenance Architecture Methodology

Apply

Healthcare

DistributionFinance

Aerospace

Automobile

Pharmaceutical

Slide from John Ibbotson

Conclusions

Crucial topic for many applications Full architectural specification An implementation available for

download Methodology to make application

provenance-aware www.pasoa.org www.gridprovenance.org

twiki.ipaw.info

Provenance Challenge

Publications1. Paul Groth, Simon Miles, Weijian Fang, Sylvia C. Wong, Klaus-Peter

Zauner, and Luc Moreau. Recording and Using Provenance in a Protein Compressibility Experiment. In Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing (HPDC'05), July 2005.

2. Paul Groth, Michael Luck, and Luc Moreau. A protocol for recording provenance in service-oriented Grids. In Proceedings of the 8th International Conference on Principles of Distributed Systems (OPODIS'04), Grenoble, France, December 2004.

3. Paul Groth, Michael Luck, and Luc Moreau. Formalising a protocol for recording provenance in Grids. In Proceedings of the UK OST e-Science second All Hands Meeting 2004 (AHM'04), Nottingham, UK, September 2004.

4. Simon Miles, Paul Groth, Miguel Branco, and Luc Moreau. The requirements of recording and using provenance in e-Science experiments. Technical report, University of Southampton, 2005.

5. Luc Moreau, Syd Chapman, Andreas Schreiber, Rolf Hempel, Omer Rana, Lazslo Varga, Ulises Cortes, and Steven Willmott. Provenance-based Trust for Grid Computing --- Position Paper. In , 2003.

6. Paul Townend, Paul Groth, and Jie Xu. A Provenance-Aware Weighted Fault Tolerance Scheme for Service-Based Applications. In Proc. of the 8th IEEE International Symposium on Object-oriented Real-time distributed Computing (ISORC 2005), May 2005.

7. Paul Groth, Simon Miles, Victor Tan, and Luc Moreau. Architecture for Provenance Systems. Technical report, University of Southampton, October 2005.

Questions