session talk @ agu09

39
AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester Scientific Workflow Management System Towards systema-c informa-on exchange and reuse in e‐laboratories AGU Fall mee-ng, Dec. 2009 Janus Provenance

Upload: paolo-missier

Post on 11-May-2015

424 views

Category:

Technology


1 download

DESCRIPTION

Presentation at the AGU'09 Fall Meeting, San Francisco, CA, Dec. 2009

TRANSCRIPT

Page 1: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK

with additional material by Sean Bechhofer and Matthew Gamble, e-Labs design group, University of Manchester

Scientific Workflow Management System

Towardssystema-cinforma-onexchangeandreuseine‐laboratories

AGUFallmee-ng,Dec.2009

JanusProvenance

Page 2: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Momentum on sharing and collaboration

2

http://www.nature.com/news/specials/datasharing/index.html

Special issue of Nature on Data Sharing (Sept. 2009)

Page 3: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Momentum on sharing and collaboration

• timeliness requires rapid sharing• repurposing• the Human Genome project use case

2

http://www.nature.com/news/specials/datasharing/index.html

Special issue of Nature on Data Sharing (Sept. 2009)

Page 4: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Momentum on sharing and collaboration

• timeliness requires rapid sharing• repurposing• the Human Genome project use case

2

http://www.nature.com/news/specials/datasharing/index.html

Special issue of Nature on Data Sharing (Sept. 2009)

• Debate is much further along in Earth Sciences– ESIP - data preservation / stewardship, 2009– Long established in some communities - Atmospheric sciences,

1998 [1]• Science Commons recommendations for Open Science

– (July 2008) [link]

[1] Strebel DE, Landis DR, Huemmrich KF, Newcomer JA, Meeson BW: The FIFE Data Publication Experiment. Journal of the Atmospheric Sciences 1998, 55:1277-1283

Page 5: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

workflowexecution

workflow+

input datasetspecification

Page 6: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

Page 7: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

Page 8: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

Page 9: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

Paul

Page 10: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

workflowexecution

workflow+

input datasetspecification

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

Data-mediatedimplicit

collaborationPaul

Page 11: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

Data-mediatedimplicit

collaboration

What is needed for Paul to make sense of third party data?

Paul

Page 12: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

Data-mediatedimplicit

collaboration

What is needed for Paul to make sense of third party data?

Paul

Page 13: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

Data-mediatedimplicit

collaboration

①②

What is needed for Paul to make sense of third party data?

Paul

Page 14: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Collaboration in workflow-based science

outcome(data)

outcome(provenance)

ResearchObject

Packaging

browse query

unbundle reuse

Data-mediatedimplicit

collaboration

①②

What is needed for Paul to make sense of third party data?

Paul

Page 15: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Common pathways

QTLPaul’sPackPaul’sResearchObject

Page 16: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Page 17: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Page 18: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Results

Logs

Results

Paper

Slides

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Page 19: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Results

Logs

Results

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Page 20: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Results

Logs

Results

Metadata

Paper

SlidesFeeds intoproduces

Included in

produces

Published in

produces

Included in

Included in Included in

Published in

Workflow 16

Workflow 13

Common pathways

QTLPaul’sPackPaul’sResearchObject

Representation

Domain Relations

Aggregation

Page 21: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

ORE: representing generic aggregations

Resource Map(descriptor)

Data structure

http://www.openarchives.org/ore/1.0/primer.html section 4

A. Pepe, M. Mayernik, C.L. Borgman, and H.V. Sompel, "From Artifacts to Aggregations: Modeling Scientific Life Cycles on the Semantic Web," Journal of the American Society for Information Science and Technology (JASIST), to appear, 2009.

Page 22: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Page 23: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Content: Workflow provenance

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Page 24: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Content: Workflow provenance

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Page 25: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Content: Workflow provenance

lister

gene_id

output

pathway_genes

get pathwaysby genes1

merge pathways

concat gene pathway ids

A detailed trace of workflow execution- tasks performed, data transformations

- inputs used, outputs produced

Page 26: Session talk @ AGU09

• To establish quality, relevance, trust

• To track information attribution through complex transformations

• To describe one’s experiment to others, for understanding / reuse

• To provide evidence in support of scientific claims

• To enable post hoc process analysis for improvement, re-design

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Why provenance matters, if done right

The W3C Incubator on Provenance has been collecting numerous use cases:http://www.w3.org/2005/Incubator/prov/wiki/Use_Cases#

Page 27: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

What users expect to learn

• Causal relations:- which pathways come from which genes?- which processes contributed to producing an

image?- which process(es) caused data to be incorrect?- which data caused a process to fail?

• Process and data analytics:– analyze variations in output vs an input

parameter sweep (multiple process runs)– how often has my favourite service been

executed? on what inputs?– who produced this data?– how often does this pathway turn up when the

input genes range over a certain set S?

9

lister

gene_id

output

pathway_genes

get pathwaysby genes1

merge pathways

concat gene pathway ids

Page 28: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Open Provenance Model• graph of causal dependencies involving data and processors• not necessarily generated by a workflow!• v1.1 out soon

A PwasGeneratedBy (R)

AP used (R)

A1

P3

A2

A3

A4

wgb(R1)

wgb(R2)

used(R3)

used(R4)

P1wgb(R5)

P2wgb(R6)

to enable provenance metadata exchange

Goal:

standardize causal dependencies

Page 29: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Additional requirements on OPM• Artifact values require uniform common identifier

scheme– Linked Data in OPM?

• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes

• OPM graphs can grow very large– reduce size by exporting only query results

• Taverna approach– multiple levels of abstraction

• through OPM accounts (“points of view”)

Page 30: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Additional requirements on OPM• Artifact values require uniform common identifier

scheme– Linked Data in OPM?

• OPM accounts for structural causal relationships– additional domain-specific knowledge required– attaching semantic annotations to OPM graph nodes

• OPM graphs can grow very large– reduce size by exporting only query results

• Taverna approach– multiple levels of abstraction

• through OPM accounts (“points of view”)

Page 31: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Query results as OPM graphs

run W

Q(prov(W))exportprov(WA)

prov(W)execute query Q

exportQ(prov(W))

OPM(Q(prov(W)))

- Approach implemented in the Taverna 2.1 workflow system

- Internal provenance DB with ad hoc query language

Just released!

W

Page 32: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Full-fledged data-mediated collaborations

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Page 33: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Full-fledged data-mediated collaborations

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Page 34: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Full-fledged data-mediated collaborations

result A → input B

exp. A

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

Page 35: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Full-fledged data-mediated collaborations

result A → input B

exp. A

exp. B

resultdatasets

A

ResearchObject

Aresult

provenanceA

workflow A +input A

resultdatasets

B

ResearchObject

Bresult

provenanceB

workflow B+input B

Page 36: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Full-fledged data-mediated collaborations

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Page 37: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Full-fledged data-mediated collaborations

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Provenance composition accounts for implicit

collaboration

Page 38: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Full-fledged data-mediated collaborations

resultdatasets

B

ResearchObjectA+B

resultprovenance

A + B

workflow B +inputB

resultdatasets

A

workflow A +input A

result A → input B

Provenance composition accounts for implicit

collaboration

Aligned with focus of upcoming Provenance Challenge 4:“connect my provenance to yours" into a whole OPM provenance graph.

Page 39: Session talk @ AGU09

AGU Fall meeting, San Francisco, Dec. 2009 - P.Missier

Contacts

The myGrid Consortium (Manchester, Southampton)

JanusProvenance

http://www.myexperiment.org

http://mygrid.org.uk

Me: [email protected]