where did i go wrong? explaining errors in process models

Where did I go wrong? Explaining errors in process models

Niels Lohmann

Verification of processes and services

- more aspects and domains = new languages and checks - domain-specific approaches are not flexible - moving target

�2

WS-Adressing

WSCIWS-CDL

WS-TXWS-C

WS-AT

WS-TX

WS-BPEL

WSDM

WSRM

BPEL4People

WSRF

WSFLWS-Policy

WS-Routing

Model checking

general purpose verification approach: 1. formalize model and specification* 2. push a button

�3

* can behidden from the user

Effectiveness and efficiency- model checking works in reality - successful applications in many domains !

!

!

!

!

- “verify while you model”�4

Diagnosis

- in case of error: outputs target state and produce a witness path

- describes how target state can be reached

- operational semantics: can be simulated

�5

witness path

target state

Diagnosis: the bad

- paths can become very long - length correlates withsize of the model

- reports all events equally: disregarding importance

�6

Reasons for useless paths

�7

detours depth-first search

indisputable parts bootstrapping

interleavings concurrency

Running example

�8

p1

p2 p3

p4

p5

p6 p7

p8

p9

p10

p11 p12

p13

t1

t2

t3

t4 t5

t6

t7

t8 t9 t10

t11

t12

t13

t14 p14

on an edge between two nodes. Tokens move through the process as a task or gatewayexecutes, taking the process from one state to another state in the usual way.

fgh

ji

k

abcde

A !

"

#

$

%

&

'

()

*

+

,

!

"

#

$$$

%&'(

%&')

*+,-'(

+,-')

.

/

%

0

1

2

3

4

Fig. 3: Translation of a task with disjoint input and output sets (left) into the correspond-ing workflow graph (center) and Petri net patterns (right).

In the center of Fig. 3, we see the translation into a workflow graph [2, 5], whichis a control-flow graph containing only gateways and tasks. To the right, we see theresulting Petri net. In general, input and output sets can overlap, which would leadto non-free-choice Petri nets as a result of the translation [12]. However, none of thesyntactically valid process models contained in our test set used overlapping inputs oroutput sets, i.e., the translation will only return free-choice nets in our case study. Thismakes it possible to benefit from fast analysis techniques for free-choice Petri nets, seefor example Sect. 4. Furthermore, users of the tool can specify which input set activateswhich output set, but this information was not provided in any of the models. For thetranslation, we therefore assumed that each input set can potentially activate each outputset. Two di↵erent translations into workflow graphs and Petri nets were implemented,although the Petri nets could also be directly obtained from the workflow graphs by awell-known construction [2]. The Petri net models are available at http://www.service-technology.org/soundness in PNML format.

2.3 Soundness

Figure 4 shows a workflow graph without any tasks as it occurs in the middle part of theprocess in Fig. 2 and to which we added a start and an end event. This process modelcontains a lack of synchronization error as well as a local deadlock, which are not soeasy to spot in the first place.

F1

J1

M1

M2

Fig. 4: Workflow graph with deadlock and lack of synchronization errors.

A local deadlock is a reachable state s of the process that has a token on an incom-ing edge e of an AND-join such that each state that is in turn reachable from s also

6

lack of synchronization

Reduction: obvious parts- assumption: progress - classification of transitions* - only report decisions

�9* not just XOR-gateways!

p1

p2 p3

p4

p5

p6 p7

p8

p9

p10

p11 p12

p13

t1

t2

t3

t4 t5

t6

t7

t8 t9 t10

t11

t12

t13

t14 p14

Reduction: obvious parts

�10

t1 t2 t9 t10 t11 t12 t14 t8 t2 t3 t4 t5

p1

p2 p3

p4

p5

p6 p7

p8

p9

p10

p11 p12

p13

t1

t2

t3

t4 t5

t6

t7

t8 t9 t10

t11

t12

t13

t14 p14


�10

t1 t2 t9 t10 t11 t12 t14 t8 t2 t3 t4 t5

“down” “down” “up”

p1

p2 p3

p4

p5

p6 p7

p8

p9

p10

p11 p12

p13

t1

t2

t3

t4 t5

t6

t7

t8 t9 t10

t11

t12

t13

t14 p14


�11

Table 1. Paths from the checks for local deadlocks

library A B1 B2 B3 C

avg. path length before / after 17.51 / 1.83 17.52 / 2.11 16.06 / 1.54 20.34 / 1.67 13.40 / 2.30

max. path length before / after 53 / 8 66 / 7 56 / 6 54 / 5 21 / 3

sum of path lengths before / after 1699 / 178 1419 / 171 1349 / 129 1688 / 139 134 / 23

reduction 89.52 % 87.95 % 90.44 % 91.77 % 82.84 %

Table 2. Paths from the checks for lack of synchronization





reduction 89.71 % 93.70 % 94.38 % 94.89 % 85.15 %

Information flow security. Furthermore, the same business process models were usedin a recent report [12] on information flow security. In this case study, noninterfer-ence [13] was verified. This correctness criterion ensures that decisions from a securedomain cannot be reproduced by investigating public runtime information of the busi-ness process. To perform this check, each business process model needed to be checkedseveral times: For each participant (i.e., swimlane of the process), one check is required.In that case study, 4050 errors were reported, yielding 4050 paths to investigate.4

Table 3 summarizes the reduction results for the paths. Again, we can report areduction between 76–87 %. The maximal reduced path of the whole case study consistsonly of 7 transitions, whereas it was 95 transitions before the reduction. On average, notmore than 2.79 transitions are reported per detected error.

The experiments report promising results. Though the reduced paths consist ofPetri net transitions, it can be easily translated back into the nomenclature of the orig-inal model. For each model translated from the IBM WebSphere Business Modelerinto a Petri net, a file was created that maps the Petri net nodes to a construct of theoriginal model, see [9]. Consequently, conflict transitions with names such as “deci-sion.s00002158.activate.s00000708” can be easily linked to the respective gateway withID s00002158 with its outgoing flow with ID s00000708.5

4 Further reduction: remove spurious conflicts

4.1 Motivation and formalization

In the previous section, we showed how paths to errors in business process models can bereduced by only reporting conflict transitions. This reduction decided, for each marking

4 The original models were not designed with noninterference in mind. However, the authors of[12] decided to use the IBM processes as case study to investigate their algorithms.

5 Note that the original business process models of the case studies were anonymized, yieldingartificial names such as “s00000708” rather than “credit check OK”.

Table 1. Paths from the checks for local deadlocks





reduction 89.52 % 87.95 % 90.44 % 91.77 % 82.84 %

Table 2. Paths from the checks for lack of synchronization





reduction 89.71 % 93.70 % 94.38 % 94.89 % 85.15 %

Information flow security. Furthermore, the same business process models were usedin a recent report [12] on information flow security. In this case study, noninterfer-ence [13] was verified. This correctness criterion ensures that decisions from a securedomain cannot be reproduced by investigating public runtime information of the busi-ness process. To perform this check, each business process model needed to be checkedseveral times: For each participant (i.e., swimlane of the process), one check is required.In that case study, 4050 errors were reported, yielding 4050 paths to investigate.4

Table 3 summarizes the reduction results for the paths. Again, we can report areduction between 76–87 %. The maximal reduced path of the whole case study consistsonly of 7 transitions, whereas it was 95 transitions before the reduction. On average, notmore than 2.79 transitions are reported per detected error.

The experiments report promising results. Though the reduced paths consist ofPetri net transitions, it can be easily translated back into the nomenclature of the orig-inal model. For each model translated from the IBM WebSphere Business Modelerinto a Petri net, a file was created that maps the Petri net nodes to a construct of theoriginal model, see [9]. Consequently, conflict transitions with names such as “deci-sion.s00002158.activate.s00000708” can be easily linked to the respective gateway withID s00002158 with its outgoing flow with ID s00000708.5

4 Further reduction: remove spurious conflicts

4.1 Motivation and formalization

In the previous section, we showed how paths to errors in business process models can bereduced by only reporting conflict transitions. This reduction decided, for each marking

4 The original models were not designed with noninterference in mind. However, the authors of[12] decided to use the IBM processes as case study to investigate their algorithms.

5 Note that the original business process models of the case studies were anonymized, yieldingartificial names such as “s00000708” rather than “credit check OK”.

Table 3. Paths from the checks for noninterference





reduction 76.87 % 81.53 % 87.16 % 82.11 % 79.29 %

that activates a transition, whether conflicting transitions are also activated. This checkis local in the sense that it is not checked whether those transitions that were not taken inthe decisions actually could have avoided the next conflict transition on the path.

We can formalize this idea as follows:

Definition 5 (Spurious conflict). Let ⇡ = t1 · · · tn be a reduced path and, for 0 i <

n be mi the marking that is reached from m0 by firing the first i transitions of ⇡. Inmarking mi, transition ti is a spurious conflict iff, for all t 2 [ti] \ T with t 6= ti andmi

t�! holds: mit�! m

0i and for all paths beginning with m

0i, marking mi+1 is eventually

reached, activating the next conflict transition ti+1 on ⇡.

Intuitively, a transition ti on a reduced path ⇡ is a spurious conflict iff every tran-sition t in conflict to ti eventually reaches the marking mi+1 which enables the nexttransition ti+1 on path ⇡. In this case, choosing any transition from the conflict clus-ter [ti] will eventually enable the next conflict on the path to the goal state. Consequently,reporting the spurious conflict ti is of little help to the modeler to understand the erroritself.

The check for spurious transitions defined above can be straightforwardly be imple-mented using a model checker.6 We integrated this check as postprocessing step afterreducing the paths as described in the previous section. Note that executing a modelchecker can be very time and memory consuming. However, even if a check is notfinished with a reasonable amount of resources, we just failed to proof whether a conflictis spurious and can continue with the investigation of the next transition. That said, thepostpocessing can be aborted at any time — any intermediate result is still correct.

4.2 Experimental resultsWe applied the reduction of spurious conflicts to the case studies described in theprevious section. Table 4–6 summarize the results. In all three experiments, the pathscould be further reduced by 50–86%. Now, in all experiments, at most two transitionsare reported as to reach the error. All other transitions are either nonconflicting or arespurious conflicts for which any resolution eventually reaches the next conflict on thepath. Note that in some cases, the check for spurious conflicts has been aborted aftermore than 2 GB of memory were consumed. In these cases, the conflict was kept in thepath and the check proceeded with the next conflict.

5 Combining paths and concurrency

5.1 Motivation and formalizationSo far, we focused on reducing paths by removing any transitions whose firing providesno information to the modeler on why the goal state was actually reached. Thereby, we

6 We check whether N with initial marking m0i satisfies the CTL formula ' = AFmi+1.

Reduction: spurious decisions

- some decisions determine others - often occurs in non-free choice models - can be model checked

�12

p1p3

p2

p4

p5

p6 p1p3

p5

p6

Reduction: spurious decisions

�13

Table 4. Reduced paths from the checks for local deadlocks





reduction 50.56 % 68.42 % 62.79 % 75.54 % 60.87 %

aborted checks 1 0 0 0 0

Table 5. Reduced paths from the checks for lack of synchronization





reduction 72.97 % 54.55 % 79.27 % 84.42 % 86.79 %


could exploit the Petri net structure to calculate conflict clusters to identify possibleconflict transitions. This allowed for a quick check whether a transition is actually aconflict.

However, we still considered paths as a sequences of transitions leading to the goalstate. As discussed earlier, this sequence may be an arbitrary linearization of originallyconcurrent behavior. Therefore, communicating paths to the modeler — for instanceby means of animation or simulation — still uses this arbitrary and hence unintuitiveordering. This may be especially confusing if the underlying process spans severalcomponents (e.g., participants of a choreography or lanes) and the path constantlyswitches between actions of different components.

To this end, this section aims at exploiting the concurrency of the Petri net model andto use it to reorganize paths. We thereby try to undo the arbitrary ordering and to providepartially-ordered paths, called distributed runs in the literature [7]. As sketched earlier,two transitions t1 and t2 may fire concurrently in a marking if they do not disable eachother; that is, any ordering of t1 and t2 are possible. The definition of Petri nets furtherensure that firing any order of concurrent transitions yield the same marking. This hasone interesting effect: by depicting concurrent transitions as concurrent (i.e., unordered),a distributed run implicitly expresses all possible orderings of these transitions.

Before we continue, we formalize the concept of a distributed run. The underlyingstructure of such a distributed run is a causal net:

Definition 6 (Causal net). A causal net is a Petri net C = [P, T, F ] without initialmarking such that (1) for each place p holds: |•p| 1 and |p•| 1, (2) the transitiveclosure F

+ of the flow relation F is irreflexive, and (3) any node has only finitely manypredecessors with respect to F

+.

Intuitively, a causal net is (1) conflict free and begins and ends with places, (2) isacyclic, and (3) the prefix of any element is finite. A causal net does not have an initialmarking — its places with empty preset represent initially marked places. This becomesclear in the definition of a distributed run of a Petri net N , defined as follows:

Table 4. Reduced paths from the checks for local deadlocks





reduction 50.56 % 68.42 % 62.79 % 75.54 % 60.87 %


Table 5. Reduced paths from the checks for lack of synchronization





reduction 72.97 % 54.55 % 79.27 % 84.42 % 86.79 %


could exploit the Petri net structure to calculate conflict clusters to identify possibleconflict transitions. This allowed for a quick check whether a transition is actually aconflict.

However, we still considered paths as a sequences of transitions leading to the goalstate. As discussed earlier, this sequence may be an arbitrary linearization of originallyconcurrent behavior. Therefore, communicating paths to the modeler — for instanceby means of animation or simulation — still uses this arbitrary and hence unintuitiveordering. This may be especially confusing if the underlying process spans severalcomponents (e.g., participants of a choreography or lanes) and the path constantlyswitches between actions of different components.

To this end, this section aims at exploiting the concurrency of the Petri net model andto use it to reorganize paths. We thereby try to undo the arbitrary ordering and to providepartially-ordered paths, called distributed runs in the literature [7]. As sketched earlier,two transitions t1 and t2 may fire concurrently in a marking if they do not disable eachother; that is, any ordering of t1 and t2 are possible. The definition of Petri nets furtherensure that firing any order of concurrent transitions yield the same marking. This hasone interesting effect: by depicting concurrent transitions as concurrent (i.e., unordered),a distributed run implicitly expresses all possible orderings of these transitions.

Before we continue, we formalize the concept of a distributed run. The underlyingstructure of such a distributed run is a causal net:

Definition 6 (Causal net). A causal net is a Petri net C = [P, T, F ] without initialmarking such that (1) for each place p holds: |•p| 1 and |p•| 1, (2) the transitiveclosure F

+ of the flow relation F is irreflexive, and (3) any node has only finitely manypredecessors with respect to F

+.

Intuitively, a causal net is (1) conflict free and begins and ends with places, (2) isacyclic, and (3) the prefix of any element is finite. A causal net does not have an initialmarking — its places with empty preset represent initially marked places. This becomesclear in the definition of a distributed run of a Petri net N , defined as follows:

Table 6. Reduced paths from the checks for noninterference





reduction 64.58 % 70.59 % 76.20 % 75.34 % 82.86 %


Definition 7 (Distributed run). Let N = [PN , TN , FN ,m0] be a Petri net, C =[PC , TC , FC ] be a causal net, and � ✓ (PC ⇥PN )[ (TC ⇥TN ) be a mapping. Furtherassume, without any loss of generality, that m0(p) 1 for all p 2 PN . The pair [C,�]is a distributed run of N iff: (1) for all pC 2 PC with •

pC = ; holds: m0(�(pC)) = 1and (2) for each tC 2 TC with �(tC) = tN holds: � bijectively maps •

tC to •tN and

t

•C to t

•N .

A distributed run is a causal net C whose nodes are mapped to those of a Petri net,such that (1) those places of C without predecessors map to the initially marked placesof N and (2) the preset and postset of a transition of C bijectively maps to the presetand postset of the respective transition of N .

5.2 Translating paths into distributed runsIntuitively, we can translate a path into a distributed run by copying fired transitionswith their preset and postset to the distributed run and “glue” those places representingresources created by one transition and consumed by another transition.

In more detail, a path ⇡ of a Petri net N can be translated into a distributed run[C,�] as follows: First, add, for each initially marked place pN of N , a place pC to PC

and define �(pC) = pN . Then, for each firing m

tN��! m

0 in ⇡, (1) add a transition tC

to TC and define �(tC) = tN , (2) for each place pN 2 •tN , find a place pC of C with

�(pC) = pN and p

•C = ; and add an arc [pC , tC ] to FC , and (3) for each place pN 2 t

•N

add a place pC to PC and define �(pC) = pN and add an arc [tC , pC ] to FC .As paths are acyclic, the created distributed run is finite by definition. Furthermore,

paths are conflict-free (i.e., every intermediate marking has exactly one successor) suchthat the created Petri net structure is indeed a causal net.

The translation into a distributed run now allows for a reasoning of the relationshipbetween occurrences of transitions on the path. We distinguish two cases: On the onehand, if there is a directed path between t1 and t2, then �(t1) was fired causally before�(t2). On the other hand, �(t1) and �(t2) were fired concurrently, if there exists neithera path from t1 to t2 nor from t2 to t1. In this case, the order on path ⇡ was arbitrary andshould not be reported as such to the modeler.

Example (cont.). Figure 2 depicts path ⇡ as distributed run. Note that the cycle in themodel is unfolded, yielding two copies of transition t2. Two places p6 without successorsmodel the target marking {p6 7! 2}. Furthermore, note transition t11 is displayedconcurrently to all transitions following t10.

Reduction: unorder transitions

- Petri nets have explicit locality - exploit to derive concurrency - helps to “distribute” actions to components - makes synchronization points (milestones) explicit

�14


�15

p1 t1 p2 t2 p3 t9 p9 t10p11 t12 p12 t14 p14 t8 p2 t2 p3 t3 p4 t4 p5 t5 p6

p10 t11 p6

p1

p2 p3

p4

p5

p6 p7

p8

p9

p10

p11 p12

p13

t1

t2

t3

t4 t5

t6

t7

t8 t9 t10

t11

t12

t13

t14 p14

t1 t2 t9 t10 t11 t12 t14 t8 t2 t3 t4 t5


�16

Summary- paths can be shortened and uncluttered - result is a partial order of important decisions - applicable to any verification goal

Open issues - error localization vs. explanation - cyclic behavior - How should a good diagnosis for $problem look like?

�17

Where did I go wrong? Explaining errors in process models

Niels Lohmann

where did i go wrong? explaining errors in process models

Education

library sum of path

length correlates

b1 b2 b3 c sum of path

obvious checks

b1 b2 b3 c reduction

checks domain

secure reduction

bad paths