invited talk: symposium on provenance in scientific workflows salt lake city, oct. 2008

76
Granular workflow provenance in Taverna 1 Paolo Missier Information Management Group School of Computer Science, University of Manchester, UK Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Upload: paolo-missier

Post on 11-May-2015

227 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular workflow provenance in Taverna

1

Paolo MissierInformation Management Group

School of Computer Science, University of Manchester, UK

Symposium on Provenance in Scientific WorkflowsSalt Lake City, Oct. 2008

Page 2: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Outline

2

• Collection values in [bioinformatics] workflows are important• Granular provenance over collections: model and issues• Measuring “provenance friendliness” of dataflows• Increasing friendliness of existing dataflows• Extending the Open Provenance Model graph to describe

granular data derivations

• Provenance service architecture - brief description

Page 3: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

IPAW'08 – Salt Lake City, Utah, June 2008

Example (Taverna) dataflow

QTL -> genes -> Kegg pathways

Page 4: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

IPAW'08 – Salt Lake City, Utah, June 2008

Example (Taverna) dataflow

Page 5: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 6: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 7: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 8: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 9: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

rearrange SNP details

• See myexperiment.org: http://www.myexperiment.org/workflows/166

Page 10: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections example: from genes to SNPs

4

gene -> genomic region

extend region

retrieve SNPs in the region

rearrange SNP details

• See myexperiment.org: http://www.myexperiment.org/workflows/166

[ ENSG00000139618 , ENSG00000083093 ]

[[<1,23554512,16,rs45585833>, <1,23554712,16,rs45594034>,...],[<1,31820153,13,ENSSNP10730823>, <1,31818497,13,ENSSNP10730820>,...] ]

Page 11: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Computational model for collections

5

Depth mismatch between declared / offered type:

type(P4:X1) = s but type(a) = list(s)

type(P4:X2) = type(c) = list(s)

type(P4:X3) = s but type(c) = list(s)

Execution at P4:

Y = (map P1 <(a ⊗ b) , c>) // cross product

Y = [ (P1 <a1,b1,c>) ... (P1 <an,bm,c>) ]

Page 12: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures

Page 13: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

Page 14: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

Page 15: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

Page 16: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

Page 17: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

Page 18: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

Dot product

Page 19: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

Page 20: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 21: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 22: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 23: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Collections and iterations

6

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809,...>

[23520984, 31786617][16,13]

<16, 23560179,..> [16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

Page 24: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Tracing granular lineage

7

• Provenance traces are most useful when they are granular– trace individual items in a collection– “which geneID is responsible for the presence of SNP

rs169546 in the output?”

• Curse of black box processors:– M-M (many-many) and M-1 (many-one) processors

destroy granularity

Page 25: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage I: no loss of precision

8

X1 X2

Y2:l(s)Y1:l(s)

P0

P1 ≡ λ X . X2

P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]

Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X:s

P2

Y:s

Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }

[a1...ai...an] [b1...bi...bm]

[a12+2b1... ai2+2bi ... an2+2bm]

[2b1... 2bj ...2bm][a12... ai2 ...an2]

Cross product

Page 26: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage I: no loss of precision

8

X1 X2

Y2:l(s)Y1:l(s)

P0

P1 ≡ λ X . X2

P2 ≡ λ X . 2XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2 = [b1...bm]

Then, P1:Y = [a12...an2], P2:Y=[2b1...2bm]P3:Y = [a12+2b1... an2+2bm]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X:s

P2

Y:s

Andlineage(P3:Y[i], {P0}) = { P0:Y1[i], P0:Y2[j] }

[a1...ai...an] [b1...bi...bm]

[a12+2b1... ai2+2bi ... an2+2bm]

[2b1... 2bj ...2bm][a12... ai2 ...an2]

Cross product

Page 27: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage II: loss of precision

9

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:s

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c... ai2+c ... am2+c]

c[a12... ai2 ...an2]

Page 28: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage II: loss of precision

9

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . min XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y = c = min {b1...bm} P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:s

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c... ai2+c ... am2+c]

c[a12... ai2 ...an2]

Page 29: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

Page 30: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

Page 31: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

Page 32: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

III: recoverable loss of precision

10

X1 X2

Y2Y1

P0

P1 ≡ λ X . X2

P2 ≡ λ X . f XP3 ≡ λ X1 . λ X2 . X1 + X2

Let P0:Y1 = [a1...an], P0:Y2=[b1...bm]

Then, P1:Y = [a12...an2], P2:Y=c P3:Y = [a12+c... am2+c]

X1:s X2:s

Y

P3

X:s

P1

Y:s

X: l(s)

P2

Y:l(s)

Andlineage(P3:Y[i]) = { P0:Y1[i], P0:Y2 }

[a1...ai...an] [b1...bi...bm]

[a12+c1... ai2+ci ... am2+cm]

[a12... ai2 ...an2] [c1...ci...cm]

“f is index-preserving”

lineage(P3:Y[i]) = { P0:Y1[i], P0:Y2[i] }

Page 33: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Multi-level nesting and lineage precision

11

Page 34: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures

Page 35: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Page 36: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 37: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 38: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 39: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Adding annotations to the original workflow

12

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

[139618, 83093]

“f is index-preserving”

“f is index-preserving”

lineage(CR:result[0,i]) = { geneIdList[0] }lineage(CR:result[1,j]) = { geneIdList[1] }

CR:result[0,i]

CR:result[1,j]

lineage(CR:result[0,i]) = { geneIdList }lineage(CR:result[1,j]) = { geneIdList }

geneIdList:

Page 40: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Granular lineage: recap

13

• Lineage query model accounts for granular traces over nested collections

• arbitrary nesting levels:– values are trees in general– lineage query identifies the correct sub-trees

• Lineage queries are efficient– recursion problem “compiled away” by query rewriting – (shameless claim - details omitted)

• But:– One single M-* processor can destroy granularity– in some cases annotations are a remedy

Page 41: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Towards provenance-friendly workflows

14

Page 42: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Towards provenance-friendly workflows

1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions

14

Page 43: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Towards provenance-friendly workflows

1.Define metrics for workflow provenance precision– how well is granularity preserved over a lineage trace?– what is the impact of M-* processors?– use to prioritize remedial actions

2.Make workflows more provenance friendly:– Add knowledge (static):

• “lightweight annotations” [MBZ+08] -- see IPAW08– Add knowledge (dynamic):

–provenance-active workflow processors– Redesign processors / workflow

• general guidelines, provenance friendly patterns

14

[MBZ+08] Missier, Khalid Belhajjame, Jun Zhao, Carole Goble, Data lineage model for Taverna workflows with lightweight annotation requirements, Procs. International Provenance and Annotation Workshop (IPAW 2008)

Page 44: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

a = [a1, a2]

Page 45: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 46: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 47: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 48: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

Page 49: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

{ P0:Y[1]= a1, P2:X=c, P3:X=e }

Page 50: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Lineage precision: example

15

b = [b1, b2] f

e = [e1, e2]

c = [c1, c2, c3]

d = [d1, d2]

lineage(P4:Y1[1.2.2], {P0, P2, P3}) =

a = [a1, a2]

precision = (1 + .5 + .5) / 3 = 2/3

{ P0:Y[1]= a1, P2:X=c, P3:X=e }

Page 51: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Precision relative to a sub-graph

16

• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables

• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables

O1

I1 I2

O2 O3

Page 52: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

!

wi!WI

wi =!

wj!WO

wj = 1

prec(I, WI , O, WO) =!

j:1...|O|

"WO(Oj)

!

Xi(pi)!lin(Oj ,I)

WI(Xi) · len(pi)nl(Xi)

#

Precision relative to a sub-graph

16

• Refining the previous idea:– precision relative to a set O of output variables and a set I of input variables

• because not all variables are equally interesting... • weights WI, WO account for relative importance of variables

O1

I1 I2

O2 O3

Page 53: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

reach(P, v) =

!1 if v is reachable from P

0 otherwise

impact(P,O) =!

o!O

W (o) · reach(P, o)

Impact of M-* processors on precision

17

O1

I1 I2

O2 O3

Count the number of variables in O that can be reached from P

• weighted sumP

Page 54: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Improving provenance precision

18

• Impact used to prioritize user actions on processors

• Precision used to assess improvement

• add index-preserving annotations

✓illustrated earlier

• refactor M-* processors

• make processors provenance-active

Page 55: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

Page 56: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

s → s

Page 57: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

139618

<16, 23520984>

s → s

Page 58: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13] [23560179, 31871809]Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 59: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 60: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 61: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 62: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Refactoring M-* → 1-1

19

l(s) → l(s)

l(s) → l(s)

s → s

s → l(s)

s → s

Processor signatures[139618, 83093]

[139618, 83093]

<13, 31871809>

[23520984, 31786617][16,13]

[16,13]<16, 23560179> [23560179, 31871809]

[ <1,23553692,16,rs152451>,...]

[<1,31840948,13,rs169546>,...]

Dot product

139618 83093

<16, 23520984> <13, 31786617>

s → s

Page 63: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

IPAW'08 – Salt Lake City, Utah, June 2008

Provenance-active processors

X: l(s) = [a1, a2, a3]

P

Y: s = b

P

X: l(s) = [a1, a2, a3]

Y: l(s) = [b1, b2]

–Passive processors do not contribute explicit provenance info

–provenance-active processors actively feed metadata to the lineage service

Dynamic annotations:

Static annotations:

aggregation f()‏ P is index-preserving

b = X[i]‏ sorting:Y = Π(X)

b = f(X[1]...X[k])

Page 64: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Open Provenance Model

• A graph notation to represent process provenance– independent of the provenance producers– suitable for exchanging provenance across different workflow

systems• State: draft 1.01 (July 2008)

21

Page 65: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

Page 66: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

wasDerivedFrom

Page 67: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐

wasDerivedFrom

Page 68: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]wasDerivedFrom

wasDerivedFrom

Page 69: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Mapping to OPM - granularity issue

22

X1 X2

Y2Y1

P0

X:s

P1

Y:s

X:s

P2

Y:s

a b

c d

fe

How can this granular dependency be described for all arbitrary paths p?

Currently cannot be expressed using OPM

P0

P1

P2

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]wasDerivedFrom

wasDerivedFrom

Page 70: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Page 71: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage

Page 72: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Path mapping rules

23

P1

P2

P3

a

b

c

dused

usedused

used

wgb

wgb

☐ ☐b[p] d[p’]actual lineage

wasDerivedFrom

Static graph structure sufficient to provide this (in Taverna)

But this is only known at query time

(extensional enumeration not an option)

Observation: • only need to consider individual processor transformations• exploit local processor rules for propagating granular lineage

Hint: granularity is only determined by depth of the pathAt query time, the Taverna lineage query algorithm encodes a path mapping rule to compute p’ given p

Page 73: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

externalservices

Page 74: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

2. Optional content for provenance-active processors:– explicit output → input dependency assertions:

let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O

externalservices

Page 75: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

Architecture provenance-active processors

24

Taverna workflow engine provenancemanager

inputs outputs

provenanceinformationrepository

provenanceevents

lineage queryinterface

lin( P:Y, , Psel, E(D))

1. Common content:–processor execution details–binding of input/output variables to values–completion status

2. Optional content for provenance-active processors:– explicit output → input dependency assertions:

let I, O be the input, resp. output variables setdepends(Y, X[p], <depType>), X ∈ I, Y ∈ O

externalservices

p-active API

Page 76: Invited talk: Symposium on Provenance in Scientific Workflows Salt Lake City, Oct. 2008

• Experimental evaluation:– to what extent is granularity a real practical problem?– Quantify provenance friendliness by analysing a large

collection of workflows from myExperiment– Quantify available improvements (i.e. by refactoring)

• Compare collection management in Taverna with other workflow models– can we sucessfully exchange provenance graphs?

• Integration of the provenance service with the new version of Taverna– to be released before end of year

25

Ongoing work