provenance challenge --- my grid david de roure university of southampton jun zhao, carole goble and...

27
Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Upload: blake-fisher

Post on 03-Jan-2016

230 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Provenance challenge --- myGrid

David De RoureUniversity of Southampton

Jun Zhao, Carole Goble and Daniele TuriUniversity of Manchester

Page 2: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Outline

• Short team introduction

• Workflow implementation

• Provenance schema and storage

• Provenance queries

• Suggestions

• Reflection

• Acknowledgement

Page 3: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Provenance Challenge Overview

Given an abstract workflow

• Implement this workflow in your system

• Collect provenance from runs of this workflow

• Present the implemented workflow and collected provenance

• Answer a list of provenance questions and present these answers

Page 4: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Taverna and myGrid

• A UK e-Science project to build middleware for in silico experiments by individual life scientists, stuck in under-resourced labs, who use other people’s applications.

• Sequence analysis, microarray analysis, proteomics, chemoinformatics, image processing, rendering Dilbert cartoons.

12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt

Page 5: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester
Page 6: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Scufl

• Data links

• Control links: limited support

• Failure tolerance: retry and alternative services

• Implicit iterations: cross/dot iterations

• Nested workflows

• Semantic metadata annotations

Page 7: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

What has to be done

• Design the workflow using Scufl in Taverna

• Build services (Web services, Soaplab services, local java, or beanshell scripts) to implement each process

• Gather and process the real data products

Page 8: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Doing it properly

• Wrap each procedure as a service• Process the real data as a real experiment• Use iterations, nested workflow or interactive

workflows supported by Taverna• Real examples:

– Chimatica (http://www.chimatica.co.uk/) supports high throughput workflows using Taverna 1.X

– MIAS-Grid (http://www.mias-irc.net/) uses myGrid to build medical image processing workflows

Page 9: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

What we did actually

• Realize each procedure as a beanshell script, to avoid real service implementation and deployment

• Pass pseudo data products rather than real image data products

• But keep the metadata about data products along with provenance to answer semantic questions

Page 10: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Implemented Scufl workflow in Taverna

Page 11: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Provenance schema

• Four aspects– Workflow provenance– Data provenance– Organization provenance– Knowledge provenance

• Provenance ontology– RDFS– OWL-lite

Page 12: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Provenance Pyramid Model

Knowledge Level

Organization Level

Data Level

Workflow Level

serviceInvocation1serviceInvocation1

serviceInvocation2serviceInvocation2

data1data1

data2data2 data3data3

data4data4

WSDLWSDL

GenomicProject

GenomicProject

similarDatasimilarData

Page 13: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

runsWorkflow

launchedBy

Organisation provenance

WorkflowWorkflow

Experimenter Experimenter

OrganisationOrganisation

belongsTo

hasInput

executesProcessRune.g. web service invocation of BLAST @ NCBI

iteration

e.g. BLAST @ NCBI

Workflow runWorkflow run

ProcessProcess

ProcessRunProcessRun

ProcessIterationProcessIteration

Workflow provenance

workflowOutput

DataData

Data/ knowledge provenance

Atomic DataAtomic Data

derivedFromKnowledge statements

e.g. similar_sequence_toKnowledge statements

e.g. similar_sequence_to

createdBy

Data CollectionData Collection

containsData

isA isA

runsProcesshasProcesses

Page 14: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Workflow provenance ontology

Page 15: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Data provenance ontology

Page 16: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Organization & Knowledge provenance ontology

• userPredicate– Semantic concept about a data product or a

service, e.g. nucleotide_sequence– Semantic (knowledge) relationships between

two data products, e.g. similar_sequence_to

Page 17: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Collected & stored provenance

• LSIDs used to identify: – data, workflows, workflow runs

– LSIDs are names of graphs

• Named RDF graphs– retrieve whole workflow runs

– implementation in

• Sesame2 native store– scalable

– alpha release (bugs)

• NG4J (Jena + MySQL)– scalability issues

• Future implementations: Oracle and Boca

Page 18: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Answer matrix1. Find the process that led to d0

(Atlas X Graphic)2. Find the process that led to d0

(Atlas X Graphic) excluding everything prior to d1 (the averaging of images with softmean)

3. Find the Stage 3, 4 and 5 details of the process that led to d0 (Atlas X Graphic)

4. Find all invocations of procedure align_warp using p0 (a twelfth order nonlinear 1365 parameter model)

5. Find all Atlas Graphic images outputted from workflows where at least one of the input Anatomy Headers) had an entry global maximum=4095

Find all the d0 that are derived from d1 where value(d1) = 4095

6. Find all output averaged images of softmean, where the warped images taken as input were align_warped using a twelfth order nonlinear 1365 parameter model

Find all the d0 that are derived from d1 where derivedFrom(d1) = d2Process

provenance Data provenance

Page 19: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Answer matrix

7. A user has run the workflow twice, in the second instance replacing each procedures (convert) in the final stage with two procedures: pgmtoppm, then pnmtojpeg. Find the differences between the two workflow runs.

8. Find the outputs of align_warp where the inputs are annotated with center=UChicago.

9. Find all the graphical atlas sets that have metadata annotation studyModality with values speech, visual or audio, and return all other annotations to these files.

Provenance cross runs Knowledge

provenance

Page 20: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Suggested Workflow Variants

Implicit iterations

Page 21: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Suggested Workflow Variants

Nested workflow

runs

Page 22: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Suggested Workflow Variants

User interactions

Page 23: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Suggested Queries

• Compare, merge and union provenance from different workflow runs

• Explain why different outputs were produced in repeated workflow runs

• Replay a workflow run

Page 24: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Categorisation of queries

Four levels:

1. queries to support the provenance browser

2. semantic queries

3. integration queries

4. pre-canned queries to support provenance usage scenarios.

Page 25: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Live systems

• Taverna: http://taverna.sourceforge.net

• Provenance plugin and browser beta release: bundled with the Taverna release 1.4.

• Provenance ontology: http://cvs.mygrid.org.uk/cgi-bin/viewcvs.cgi/mygrid/miasgrid/rdf-provenance/etc/ontology/

• System requirement:– Windows, Linux, Mac– Java 5.0– mySQL database (optional)

Page 26: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Reflection

• A systematic provenance query framework is needed

• Separate data and provenance metadata– Better storage scalability– Avoid archiving duplicate data products

• A consensus of provenance models

Page 27: Provenance challenge --- my Grid David De Roure University of Southampton Jun Zhao, Carole Goble and Daniele Turi University of Manchester

Acknowledgement

• The myGrid Taverna team: Tom Oinn, Stuart Owen, Stian Soiland, David Withers, Katy Wolstencroft and June Finch

• Daniele Turi: provenance plugin

• Matthew Gamble: Taverna provenance browser

• Chris Wroe from the original myGrid project