a graph model for data and workflow provenance€¦ · a graph model for data and workflow...

Post on 19-Sep-2020

12 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

A graph model for data and workflow provenance

Umut Acar, Peter Buneman, James Cheney, Natalia Kwasnikowska,

Jan van den Bussche, & Stijn Vansummeren

TaPP 2010

Provenance in ...• Databases

• Mainly for (nested) relational model

• Where-provenance ("source location")

• Lineage, why ("witnesses")

• How/semiring model

• Relatively formal

• Workflows

• Many different systems

• Many different models

• (converging on OPM?)

• Graphs/DAGs

• Relatively informal

Provenance in ...• Databases

• Mainly for (nested) relational model

• Where-provenance ("source location")

• Lineage, why ("witnesses")

• How/semiring model

• Relatively formal

• Workflows

• Many different systems

• Many different models

• (converging on OPM?)

• Graphs/DAGs

• Relatively informal

?????

This talk

• Relate database & workflow "styles"

• Develop a common graph formalism

• Need a common, expressive language that

• supports many database queries

• describes some (simple) workflows

Previous work

• Dataflow calculus (DFL), based on nested relational calculus (NRC)

• Provenance "run" model by Kwasnikowska & Van den Bussche (DILS 07, IPAW 08)

• "Provenance trace" model for NRC

• by (Acar, Ahmed & C. '08)

• Open Provenance Model (bipartite graphs)

• (Moreau et al. 2008-9), used in many WF systems

NRC/DFL background

• A very simple, functional language:

• basic functions +, *,... & constants 0,1,2,3...

• variables x,y,z

• pair/record types (A:e,...,B:e), πA (e)

• collection (set) types

• {e,...} e ∪ e {e | x in e'} ∪e

An example

An example

• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

An example

• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

An example

• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}

An example

• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}

= sum {1 * 2, 4 * 5}

An example

• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}

= sum {1 * 2, 4 * 5}

= sum {2,20}

An example

• Suppose R = {(1,2,3), (4,5,6), (9,8,7)}

sum { x * y | (x,y,z) in R, x < y}

= sum { x * y | (x,y,z) in {(1,2,3), (4,5,6)}}

= sum {1 * 2, 4 * 5}

= sum {2,20}

= 22

Another example

• In DFL, built-in functions / constants can be whole programs & files,

• as in Provenance Challenge 1 workflow:

let WarpParams := {align_warp(img,hdr})

| (img,hdr) in Inputs} in

let Reslices := {reslice(wp)

| wp in WarpParams} in

softmean(Reslices)

Goal: Define "provenance graphs" for DFL

Goal: Define "provenance graphs" for DFL

let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} inlet Reslices := {reslice(wp) | wp in WarpParams} inin softmean(Reslices)

Goal: Define "provenance graphs" for DFL

let WarpParams := {align_warp(img,hdr}) | (img,hdr) in Inputs} inlet Reslices := {reslice(wp) | wp in WarpParams} inin softmean(Reslices)

http://www.flickr.com/photos/schneertz/679692806/

First step: values

c

<>{}

...

elem

elem A1

An

v

v

v

or

v

v

or ...copyvor

Example value

1

<>

{}elem

elem

A

B2

3

<>A

B

Next step: evaluation nodes ("process")

c ...

1

n

e

fe

x letx

body

e

e

Constants,primitive functions

Variables & temporary bindings

head

Pairing

...

A1

An

e

<>e

πAe

Record building

Field lookup

Conditionals

iftest

then

e

eif

test

else

e

e

Note: Only taken branch is recorded

Sets: basic operations

{}e

∪1

2

e

e

Empty set

Singleton

Union

Sets: complex operations

∪e

forx

head

body

e

e

ebody...

Flattening

Iteration

Provenance graphs

• are graphs with "both value and evaluation structure"

!

" #

$

% &

!

"

#$%&" '

(

)

#$%&"

'

(

*

+

, '

'-(

./01"

2/34

(5 6%4"

./01!

2/34

$%&"

6%4"$%&"

!

" #

$

%

&'(

)

#

*%

+,-&'(

#

%

./01

A bigger example

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Value structure

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Value structure

{} {}{}

C

C

2

C

CC

C

C

C

T

{}

{}

{}

{}<>

1

2

1 <>

1 C

{}

C

C

CC

C F

C

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Input values

{} {}{}

C

C

2

C

CC

C

C

C

T

{}

{}

{}

{}<>

1

2

1 <>

1 C

{}

C

C

CC

C F

C

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Return value

{} {}{}

C

C

2

C

CC

C

C

C

T

{}

{}

{}

{}<>

1

2

1 <>

1 C

{}

C

C

CC

C F

C

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Expression structure

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Expression structure

=

fstx

snd

empty

let R

let S

for ysUfor x

if

if

R=

{}x

snd

fst

fst

sndy

+

Building provenance graphs

• is complicated

• Here we'll use high-level "graph rewrite rule" formalism

• Mostly because it is nicer to look at than formal version

cc c

ff f(v1,...,vn)

1

nvn

v1

...

1

nvn

v1

...

letx

head

body

v

ex

head

body

v

ex

letx copy

copy

πAi<>

A1

Anvn

v1

...

<>

A1

Anv

v

...πAi...

vi copy...

...vi

<>A1

Anvn

v1

...

A1

Anvn

v1

... <> <>

A1

An

if

e2

e1

True

e1

Trueif

test

then

test

then

else

if

e2

e1

False

e2

Falseif

test

else

test

then

elsecopy

copy

empty?{} {} empty? False

...

elem

elem

v

v

...

elem

elem

v

v

empty?{} {} empty? True

∅ ∅∅

{}

...

elem

elem

v

v

{}

...

elem

elem

v

v

{}

v

elem

v {}{}

{}∪...

v

v

...

v

v

elem

elem

{}

elem

elem

{}

elem

elem

......

OK, take a deep breath!

e

e

x copy

x copy

forx

head

body

{}

ex

...

elem

elem

vn

v1

head

body

{}

...elem

elem

vn

v1

body

forx {}

elem

elem

...

...

elem

elem

v

v

{}

...

elem

elem

v

v

{}

elem

elem

{} ∪

...

elem

elem

v

v

{}

...

elem

elem

v

v

{}

elem

elem

{} {}∪

elem

elem

......

...

An example

forx

head

body

{}elem

elem

2

1

+1

x

An example

forx

head

body

{}elem

elem

2

1

+1

x

An example

head

body

{}elem

elem

2

1

+1

forx {}

elem

elem

+1

x C

x C

An example

head

body

{}elem

elem

2

1

+1

forx {}

elem

elem

+1

x C

x C

An example

head

body

{}elem

elem

2

1

+

forx {}

elem

elem

+

x C

x C

1 1

1 1

An example

head

body

{}elem

elem

2

1

+

forx {}

elem

elem

+

x C

x C

1 1

1 1

An example

head

body

{}elem

elem

2

1

forx {}

elem

elem

x C

x C

1 1

1 1

+ 2

+ 3

Graphs can "lie" (inconsistency)

•+ 5

2

2

Graphs can "lie" (inconsistency)

•+ 5

2

2

if copy

2

True test

else

Graphs can "lie" (inconsistency)

•+ 5

2

2

if copy

2

True test

else

4

3

2

1

head

body

elem

elem

body

forx {}

elem

elem

{}

3

4

Graphs can "lie" (inconsistency)

•+ 5

2

2

if copy

2

True test

else

4

3

2

1

head

body

elem

elem

body

forx {}

elem

elem

{}

3

4

"Locally" but not "globally" consistent

Graph queries

• Many possible approaches

• In paper: some Datalog

• Maybe overkill, seems fragile

• In code: some "annotation propagation" traversals

• Seems to handle where, "explanations", "summaries"

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Explaining

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Explaining

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Explaining

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Explaining

Note: Smallest consistent subgraph (NOT

transitive closure!)

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

Summarizing

!

"

#$%

&'()

*+,$-.

/&'()0

&'() 1

0

0

2+3

4#

5678 %8$%

2+3

%98-

"

#$%

&'()

*+,

$-./

&'() 0

&'()1

0

1 8:(%) 4#

;<=$8 %8$%

2+38=$8

#'6+"&'() 98<.

&'()

>'.)

&'()>'.)

2+3

?2+3

@

)

#$%

&'() $-.A

&'()

0&'()

1

#'6+)&'() 98<.

1

>'.)

2+3=8%+@

98<.

2+3

>'.)

=8%+!

98<.

&'()

>'.)

0

12+3

0

12+3

&'()

+ 2

Summarizing

{}1

1

Graphs are partially "replayable"

• If we change a value node, can try to "readjust" to recover consistency

• Formalized in (Acar, Ahmed, Cheney 08)

+ 4

2

2

Graphs are partially "replayable"

• If we change a value node, can try to "readjust" to recover consistency

• Formalized in (Acar, Ahmed, Cheney 08)

+ 4

2

17

Graphs are partially "replayable"

• If we change a value node, can try to "readjust" to recover consistency

• Formalized in (Acar, Ahmed, Cheney 08)

+

2

17

19

Graphs are partially "replayable"

• If we change a value node, can try to "readjust" to recover consistency

• Formalized in (Acar, Ahmed, Cheney 08)

+

2

17

19

if copy

2

test

else

False

Graphs are partially "replayable"

• If we change a value node, can try to "readjust" to recover consistency

• Formalized in (Acar, Ahmed, Cheney 08)

+

2

17

19

if copy

2

True test

else

Graphs are partially "replayable"

• If we change a value node, can try to "readjust" to recover consistency

• Formalized in (Acar, Ahmed, Cheney 08)

+

2

17

19

if

2

True test

else

Stuck!

????

Implementation in Haskell

• Summarized in paper, full code on request

• roughly 250 LOC for basic evaluator

• another 300 for graphviz translation, basic queries, examples

• Point?

• No claim of efficiency/scalability but easy to understand, experiment

• Elucidates some tricky details that pictures hide

• Similar "lightweight modeling" might be valuable for understanding/relating other WF/DB models

Related work• This work synthesizes/rearranges ideas from

several previous works & "folklore"

• traces (Acar, Ahmed, Cheney 2008)

• runs (Kwasnikowska, van den Bussche, DILS 2007, IPAW 2008)

• OPM graphs (Moreau et al. IPAW 2008 etc.)

• and many workflow systems

• More can be done to relate DB & workflow models

Future work

• This is work in progress

• Next steps:

• Extending to understand/model other workflow features

• Better grasp of "real" queries and features needed

• Implementa(tion|ability)?

• Optimization?

Conclusions

• DB & WF provenance have much in common

• We develop common graph model

• with both intuitive & precise presentations

• Still much to do to relate and integrate DB & WF models

• let alone integrate models at scale in real systems

top related